When I use a machine learning algorithm to make regression prediction, why do I find that there is a constant between the predicted value and the actual value? Could you tell me what may cause this phenomenon?Is it a model problem or a data problem?
If you are using some flavour of RNN, this is because it overfits and simply outputs the value of the previous timestep (and it looks constant because it is hard to distinguish point k and k+1 in a timeseries of thousands of points). In which case it is likely both a model and a data problem (As in, model overfits and data not enough).
Otherwise, if you are trying to predict point estimates, i would also imagine that this is a model problem, but without more details (what are you trying to predict, what architecture, how many data, etc) it is really hard to tell.
A difference between the predicted regression value and the actual value is called residual. One of the main assumptions of the regression analysis is the normal distribution of the residuals with the mean equal to 0, i.e residuals must be both positive and negative. If this condition is not met (residuals are only positive or negative, i.e. the model either consistently over-predict or under-predict) this means that the regression model is poorly chosen for prediction, although it might reasonably fit the data.
The reasons for poor prediction might be that (i) the regression model should include non-linear terms instead of being just linear, (ii) multicollinearity of the independent variables, (iii) missing important predictive variables (features), i.e. a poor understanding of the problem and the data, (iv) non-constant data variance, or (v) incorrect use of the model for extrapolation (prediction) way beyond the range of the independent variables that were used for the model's parameters estimation.
Overall, your focus should be that the model meets assumptions of regression analysis (model problem) rather than on machine learning technology.
You said that "...there is a constant between the predicted value and the actual value," but you cannot actually mean that the difference is always a known constant. That would mean that you know all dependent variable values with a zero estimated residual. I mean that for that model, the "irreducible error," sigma, is zero. So all predicted-y values would be associated with e=0, so y = predicted-y. Or if you have not accounted for the constant, c, yet, whether always negative or always positive,
y = predicted-y - c.
Perhaps you can clarify what you meant to say by providing an example.
In general,
y = predicted-y + e,
where e is a random variable, often heteroscedastic, but with a random factor.
Perhaps you meant
y = predicted-y + c + e.
Considering that if there were an intercept term, it would be part of predicted-y, then what you would mean here would be to say that there is a model bias. That is, we do not have model-unbiasedness, so the expected sum of the estimated residuals is not 0, but c. That would be a model problem. But I am not certain that that was your question.
Please change the last paragraph to the following:
Considering that if there were an intercept term, it would be part of predicted-y, then what you would mean here would be to say that there is a model bias. That is, we do not have model-unbiasedness, so the expected sum of the estimated residuals is not 0, but c. That would be a model problem. Also, sigma is still zero. That appears to also be a model problem. But I am not certain that that was your question.
James R Knaub Alexander Kolker Ioannis Kouroudis Jamie Wallis thx you all,with your help, I have a general understanding of this problem, and I will consider these problems in my practical problems.