I'm currently working on a topic which aims at designing deep learning model to do short-term forecasting of time series. The data is collected from telecom equipments which is non-stationary and highly stochastic. When I try to compare the performance of different models as LSTM, Transformer and the others, there is a problem of the result.
In general, the time series is quite difficult to forecast, and if I check MAE and MSE, the difference of different models are very small. For example, the MSE of LSTM is 0.282 +/- 0.14. If I use Transformer, the MSE could be 0.273 +/- 0.12. Even though Transformer has the lower mean value of MSE, but by considering the standard deviation of MSE is quite large, the result is not statistically significant. In this case, how can I evaluate the performance of different models by considering the performance improvement is always much lower than the standard deviation of the loss.