I am working on a project relating to the field of speed prediction for street segments in future.
Imagine we got a dataset (which is not normally distributed) like this:
(Time-date | Speed | Density | Day of Week | Hour | Label)
(2018-12-27 12:30 AM | 78.5, 32 | 0 | 12 | 56.43)
It says one car with speed equal to 78.5 is driven on Saturday (`Day of Week` column) 2018-12-27 at 12 O'clock and its speed in another half an hour ('label' column ) is 56.43.
As you might have guessed the task is to predict the speed in another half an hour. The problem is that there is a very high correlation between speed and label and this makes issues for building a model.
When I get the prediction list and plot it against the test data labels I get `pic1.png`. The odd part for me is when I plot the prediction against speed, as you can see in `pic2.png`, the points are closer to the bisector which imply that model predicts the speed not the labels.
I did the normalizing, have used SVMregressor, NNregressor, Random Forest regression, Xgboost, .... but nothing works.!
It would be best if someone could suggest me a solution to deal with the problem with reasoning in details.
It's also appreciated if someone could send me some useful link.
Thank you guys