Hello !

In a work, I have to benchmark different algorithms to fill in missing values in time series.

I insist on the fact that this is imputation and not forecasting.

In my case, I have access to 15 years of complete temperature data from 20 stations.

I have different algorithms that, knowing the position of missing values, try to complete the missing data.

However, these algorithms have some parameters to be set. So I want to implement a classical method such as k-cross fold validation.

Moreover, these algorithms have to be tested in a classical case of time series completion. Indeed, here is a typical drawing of time series data completion:

http://prntscr.com/106qpc2

Each line is a time series. (note that some series are and must be complete for the benchmark)

In red are the missing data to be completed, and in green are the unknown data. In practice, these algorithms use the notion of space-time to complete missing data.

However I am facing a problem. Indeed, the method of k-cross fold validation takes randomly data and it is this purely random character which poses me a problem. Indeed, as shown on the drawing, some algorithms could work better if I take my data completely randomly. Whereas in reality, they will only be tested on a template as on the image. Indeed, I know in advance that some of my algorithms work very well when blocks of several years of known data are present and don't work at all when purely randomly selecting the known data.

Do you have an idea of how to parameterize my algorithms to be optimal on a data framework such as the one on the image?

Thanks in advance

(Feel free to ask me questions if I haven't been clear enough)

More Benoît Loucheur's questions See All
Similar questions and discussions