I am working on a project relating to the field of speed prediction for street segments in future.

Imagine we got a dataset (which is not normally distributed) like this:

(Time-date | Speed | Density | Day of Week | Hour | Label)

(2018-12-27 12:30 AM | 78.5, 32 | 0 | 12 | 56.43)

It says one car with speed equal to 78.5 is driven on Saturday (`Day of Week` column) 2018-12-27 at 12 O'clock and its speed in another half an hour ('label' column ) is 56.43.

As you might have guessed the task is to predict the speed in another half an hour. The problem is that there is a very high correlation between speed and label and this makes issues for building a model.

When I get the prediction list and plot it against the test data labels I get `pic1.png`. The odd part for me is when I plot the prediction against speed, as you can see in `pic2.png`, the points are closer to the bisector which imply that model predicts the speed not the labels.

I did the normalizing, have used SVMregressor, NNregressor, Random Forest regression, Xgboost, .... but nothing works.!

It would be best if someone could suggest me a solution to deal with the problem with reasoning in details.

It's also appreciated if someone could send me some useful link.

Thank you guys

More Masoud Masoumi Moghadam's questions See All
Similar questions and discussions