Does standardization or normalization of categorical features have any statistical validity for performing on machine learning models?

More Sourav Roy's questions See All

What are the fundamental ideas of Monte Carlo dimensionality estimation, compared to Rasch residuals PCA?

I read an interesting article from Christensen (2007): A Monte Carlo approach to unidimensionality testing in polytomous rasch models. But do you have some further thoughts or maybe other...

08 August 2024 4,245 0 View

How to quantify polystyrene microplastic (8 micron) bioaccumulation in fish tissue?

"I have treated adult zebrafish with 8-micron polystyrene microplastic and want to study the bioaccumulation in different organs. Can this be done using hydrogen peroxide digestion followed by...

05 August 2024 853 3 View

In the presence of VERY high concentration (1000 ppm) of silicon, the adsorption isotherms of arsenic are showing anomalous trends, why so?

I conducted an adsorption experiment of arsenic on soil in the presence of different doses of silicon as competing ions to see the effect of silicon on arsenic adsorption and desorption. I took 5...

03 August 2024 6,500 3 View

I am working on my Master's thesis on the biogeography of the genus Ruagea and I would like to ask, could someone help me to check whether my result?

I created a file with my outgroup and ingroup species using Beauti, ran it in BEAST, viewed it in Tracer, and then used TreeAnnotator to create a file that I imported into RASP. Could someone...

28 July 2024 2,979 1 View

Can telekinesis, telepathy, and prediction be achieved through science?

Information in The Three Dimensions of Time.Information in The Three Dimensions of Time Information Transfer based on Brains Entanglement.Information Transfer based on Brains Entanglement Time...

28 July 2024 7,522 0 View

Can Langmuir constant related to bonding energy (Kl) and Freundlich exponent (1/n) indicative of ads intensity show opposite trends with temperature?

The adsorbate is Arsenic and the adsorbent is a soil

18 July 2024 332 8 View

Is it possible that the Earth's axis of rotation could change?

Due to the shifting ocean currents, the planet could be thrown off balance, altering the angle of rotation, and even causing the axis to shift, plunging us into chaos. In reality, it doesn't take...

15 July 2024 8,949 1 View

Can we calculate the CO2 adsorption capacity(mmol/g) of a solid adsorbent (MOF) using CO2_TPD analysis?

what is the relationship b/w TCD signal(a.u) and CO2 desorbed (mmol/g) ? can we plot CO2 desorbed(mmol/g) Vs Time(min) from TPD data?

15 July 2024 3,248 1 View

Please give me the scoring system of Academic stress scale developed by Rajendra and Kalikapan?

please give me the scoring system of Academic stress scale developed by Rajendra and Kalikapan?

11 July 2024 2,054 3 View

How can we calculate the polygon scanner speed by changing the supply voltage?

I am using a polygon mirror scanner that have 6 facets and the dc brushless motor speed was given 10,000 RPM. Now I want to know how to calculate the polygon speed if I change the power supply...

11 July 2024 5,942 1 View

• What the possible Persistent Organic Pollutants and Heavy metals present in fluorspar, sediments, and water bodies around its mining area?

Approximate concentrations are require in compared with the WHO permissible limts

11 August 2024 2,723 1 View

Determining the worth of a point improvement in Hamilton Depression Scale?

Dear readers, Thanks for your attention. I am wondering about the health economic problem of quantifying the value of interventions which a) prevent, b) improve symptom profile and c) ultimately...

05 August 2024 3,246 1 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

Why wait for a doctor's visit when you can become the guardian of your child's health today?

Join our exclusive WhatsApp group and access the latest research-based videos on children's health! Discover the hidden toxins and other risks that may harm your kids, and learn the best practices...

04 August 2024 1,264 0 View

Ready to take control of your child's health and well-being?

Join our exclusive WhatsApp group and gain access to the latest research-based videos on children's health! Uncover the hidden toxins and other risks that may threaten your kids, and equip...

04 August 2024 5,895 0 View

How to report results of Generalised Linear Mixed Models in a journal article?

Hi everyone, If you have written or come across any papers where Generalised Linear Mixed Models are used to examine intervention (e.g., in mental health) efficacy, could you please share the...

04 August 2024 4,130 4 View

How Social Media Affects Your Mental Health ?

How Social Media Affects Your Mental Health

04 August 2024 6,961 3 View

Need help with my research project on open source SIEM and machine learning?

Hello everyone, I am currently working on a research project that aims to integrate machine learning techniques into an open source SIEM tool to automate the creation of security use cases from...

04 August 2024 3,196 2 View

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

Machine learning (ML) has shown great potential in predicting the compressive strength of concrete, an important property for structural engineering. However, its practical application comes with...

03 August 2024 2,546 2 View

How can I find access to heavy metal reference doses from the EPA and WHO websites?

Reference dose and Maximum acceptable concentrations HMs

03 August 2024 8,230 4 View

Dinesh Reddy Sagam

Standardization or normalization of categorical features is only valid if the categories represent a meaningful, ordered quantity. For nominal variables, it’s better to use encodings that respect their non-numeric nature; forcing numeric scaling otherwise has no statistical justification and may harm model performance.

Lotfali Bolboli

No, you should not standardize or normalize your numerically encoded categorical features. This practice is methodologically inappropriate and statistically invalid. Core Reasons: Violation of Measurement Principles: Standardization (Z-score normalization) and normalization (Min-Max scaling) are designed for continuous numerical data where magnitude and order are meaningful. Applying them to numerically encoded nominal categories (e.g., 1=USA, 2=Canada, 3=Mexico) imposes a false mathematical relationship and distance between categories. Distortion of Feature Relationships: This process creates artificial numerical distances that can mislead algorithms, particularly distance-based ones. No Benefit for Tree-Based Models: Algorithms like Decision Trees, Random Forests, and Gradient Boosting are invariant to monotonic transformations of features. They only care about split points, so standardizing provides no performance benefit. Correct Approach: For Distance-Based Algorithms (e.g., KNN, SVM, K-Means) and Linear Models: The correct method is to use One-Hot Encoding (OHE). OHE creates new binary columns for each category, representing them in a vector space without imposing an artificial order or magnitude. For Tree-Based Models: You can often use label encoding or ordinal encoding directly, as these models do not rely on feature distance.

Sadia Tariq

If u have two labels for a feature say yes/no and u encode them to 0/1 , is there any need for standardization?? Doesn't make sense to me

Sourav Roy

Dinesh Reddy Sagam & Lotfali Bolboli Thank you for your invaluable suggestions. I completely agree with you and think this is the best approach. Could you kindly clarify one more question? In my study, I am using eight machine learning classifiers, tree-based (DT, RF, GBM, XGBM) and distance-based (KNN, SVM). All the features in my dataset are categorical. Is it acceptable to directly use the data to apply the tree-based methods and only One Hot Encoding method to distance-based methods??

To me, a better smarter encoding, which u may call embedding is needed when the textual data is longer,with dependencies and meaningfulness.

Sadia Tariq It follows binomial distribution for categorical features before standardization, and N(0, 1) for numerically encoded categorical features after standardization. Standardization helps to reduce the distance among the numeric features. For example, Income ($4000, $3500, $5000, ...), and Age (28, 30, 27, ...). Applying standardization on Income and Age would convert them to same scale. Could you please explain why doesn't make sense to you??

Sadia Tariq Yes, I am agree with you that numerical conversion is accepted. However, I am anxious about the conversion to N(0, 1) from binomial.

Can u tel me distance between 0,1 in this case? For one hot encoding no matter how many zeroes u have, would u need to "normalize" the distance?

Sadia Tariq I am new to machine learning and currently reading some books. My concern was clearly explained by Dinesh Reddy Sagam and Lotfali Bolboli, and which also found consistent as those methods mentioned in the books. However, I am now completely confused by what you wanted to explain to me. Could you kindly explain it again using an example? I would highly appreciate any of your suggestion in this context.

Nahom Belete

Standardizing numerically encoded categorical features generally has no statistical validity because the numeric codes themselves do not represent true quantitative relationships, e.g., “Illiterate = 1” and “Primary = 2” do not imply that Primary is twice Illiterate. Standardization assumes meaningful numeric distances, which categorical variables lack. For tree-based models (DT, RF, GBM, XGBoost), neither encoding order nor scaling matters since they split on categories, not numeric magnitude. For distance-based models (KNN, SVM with RBF), one-hot encoding is preferred to avoid introducing artificial ordinal relationships. In short: do not standardize categorical variables; instead, use proper encoding (one-hot or target encoding) when needed.

Exactly Nahom Belete I am strongly agree with you.