Hi everyone,
I'm working on an action prediction (i.e., early recognition) using Bag-of-Words. I know how to use BOW, but regarding action prediction, there is a subtle problem.
In the literature, always accuracy is reported w.r.t observation ratio. So if a video is 50 frames long, I'm supposed to run the algorithm on the first five frames (observation ratio = 0.1). But most of the time, descriptors are only available after, like, 10 or 16 frames. For example, if the extent cuboid around the key point is considered ten frames, no descriptor can be calculated until ten frames have been observed.
Meanwhile, many works have used such features for action prediction. Especially, that was the mainstream method before deep learning was widely adopted.
Do you know how this is possible? None of the papers points this out.
Thanks a lot