During the submission process of my recent work, there were mainly two valuable comments from the reviewers, which made me very interested to continue the discussion. Here I first expand one of the issues and hope to discuss it with you.
Whether to standardize data is a topic that has been discussed for a long time the model building process in many fields. In the following discussion, I will focus on the field of property mass appraisal.
For the scale of variables:
Price: 1000-100000+ RMB Yuan/m2;
Age: 0-100+ years
Bedroom: 1-6?
Decoration condition: 0-1
Ratio of Elevator: 0-1;
Floor Area Ratio: 0-15+?
Green Ratio: 0-0.6+?
Distance to POIs: 0-2 km? or 0-10 km? or 0-2000 m?
.........
(Example source: Article Mass Appraisal Modeling of Real Estate in Urban Centers by G...
)It can be seen that under the measurement of different units, the numerical difference of the variables is relatively large. (also the distribution or density estimation is another important issue)
Here are some points I considered.
1. For the linear regression model, whether it is normalized or not does not affect the results of the model, such as the value of R2.
2. If the same variable between different models (i.e. Hedonic Model1 vs Hedonic Model 2 or Hedonic Model 1 vs Tree-based Model 1) wants to compare its coefficients, the same standardization method is needed for the different models.
3. Models such as neural network, PCA, and support vector machine, standardization is a good and must-do choice for the data sets. But models such as linear regression, logistic regression, and decision tree, standardization will not affect the results.
4. On the contrary, if standardization removes the unit metric, we do not know what is being compared between different variables.
Besides, if you think standardization is needed, what software and corresponding function modules would you use or recommend?
Thank you in advance!