I am working on a project in the field of traffic data analysis. we are working on a routing app for vehicles with considering traffic data.
Imagine each street has got data-set gathered by users like this:
exact date and time | density of traffic | speed of traffic | weather | next_half_an_hour_speed (label)
we can derive "day of week", "Hour" and "hour quarters" from date column and add to features. the data is inserted for almost each quarter intervals.
for example:
2019-01-03 00:12:23 | 20 | 61.54 | Rainy | 23.12
2019-01-03 00:26:28 | 54 | 31.13 | Rainy | 40.12
Now I want to ask your suggestions on this problem which I am about to explain.
In uploaded file "1.png", You can see a plot of "label (next half hour speed) histogram" and in the second file "2.png" you see predictions (predicting next half-hour speed of traffic based on sklearn.svm.SVR in python) in contrast to test values.
Now based on these observations I come to conclusion that because of my imbalanced samples in speed feature (which has the most linear correlation with label and it also has the same histogram as label) in range of [22, 55], the predictions in this range are that terrible.
Do you think this conclusion is correct?
I was wondering if anyone could help me find a good source to use (like using SMOTE tools) for my problem.