The Scikit-learn and pandas-based scaling will work for small amounts of data, and for large & already stored data using RasgoQL or SQL is preferable. Please Refer(https://towardsdatascience.com/three-techniques-for-scaling-features-for-machine-learning-a7bc063ecd69)
If you are talking about the effect of units, I think sometimes normalization makes it difficult to interpret the results. However, it is common to use x-mu/sigma for each independent variable so that each lies between 0 and 1. In this way, I have seen they convert it back at the end to interpret. However, for neural networks, specially if it has multiple layers, I am not sure what it means to normalize as it goes through layers and changes dimensions.
To sum up, I really think scaling highly depends on the application for which you are using the neural network.
NEURAL NETWRORKS, WHEN USED FOR PREDICTIVE ANALYTICS IN DATA MINING, WILL USE VARIABLES IN A DATAFRAME to make a prediction of the dependent variable, having subset the data as training and test portions, usually 80 and 20 percent, respectively. I will give an example of a case whereby longitudes, depth of a well, and fluoride levels in a study area are set out as the columns for prediction. the units of longitudes in UTM are in millions, eg, borehole X is located in longitude 5664988, drilled to a depth of 5m and has 0.72 mg/L in fluoride levels content. Now if there are 100 such rows of data and u r given a new row with only depth and longitude, one readily notices that the units of longitude far out overwhelms the other variable units. the values of longitude will overwhelm those of the depths, and this will end up giving wrong answers. This calls for standardization of data--also known as NORMALIZATION. this will make predictions to be undertaken wen al units are similar....hence solves the problem of overfitting.
This may involve computing the average of every column variable, and getting its standard deviation, then using these two parameters to standardize, or just picking the vector value of every column, so that the highest value is substracted from this vector, then divided by these highest value,. This way, the value will range between 0 and 1. the values so-scaled are the ones used for error back propagation until a reasonable ,acceptable error range is attained, so that inverse scaling is now performed to get the predicted value. one may have to do this in software's like R and python as a calculation may go for over 5000 iterations during back propagation ,say, with gradient descent.
model weights and bias will help compute the models convergence--the point at which the model may be taken to have generated n appropriate answer. several good examples abound on YouTube links to that effect, as shown hereunder:-