Hi,
I have been working on online action recognition (you estimate actions from stream of a video). I have read some papers about this issue. There are two approaches that often utilized.
1. Using sliding window to estimate an action from sequence of frame.
2. Estimating action frame by frame
For the first approach, mostly missclassification occurs during onset and offset phase (when a person starts or stop performing an action).
For the second, we should predict an action based on glimpse of information, since decision should be made as soon as possible.
I want to ask about what do you think about those approaches?
Any suggestions will be very helpful