But similar ideas are already being developed, check the work of Michal Irani, she is working on networks predicting the rest of a video based on a short fragment
I think that you need to define more accurately what is image prediction.
If you look at [1], defining image prediction is disambiguated to next frame prediction. In contrast other work that I have seen image prediction refers to image enhancement (e.g. superresolution).
If you are referring to the image prediction as in [1], then you need to also constrain it to "real life" video. As explained in the reference the images can be predicted due to underlying continuity (i.e. by following the laws of physics, etc.). As also discussed in the reference the prediction interval is rather short since additional multi body dynamics may become too complex to predict. On the other hand video that belongs to movies are on a different category since the production of the narrative can jump from scene to scene in a "chaotic" fashion.
Artur Gańcza ,
Can you provide the direct reference to the work you are referring to? I do not seem to find any reference inn the Google Scholar profile of that researcher.
References
[1]Article Deep Learning in Next-Frame Prediction: A Benchmark Review