There are many methods to do this job. It's very hard to talk about the best method. However, I think the motion tracking by optical flow is very good approach!
If you are interested to track an object (e.g., human) in a video than removes noise from the video frames, segments the frames using frame difference and binary conversion techniques and finally tracks the object (e.g., human) using a bounding box based on occurrence of high intensity values horizontally and vertically.
Hi, take a look at these papers, all proposed ways are easy to understand and implementation, and so helpful. You should think about what you need and what is better for your purpose, then Based on your necessity, You can select the best one and use it or even upgrade it.
There are many methods to do this job. It's very hard to talk about the best method. However, I think the motion tracking by optical flow is very good approach!
You can extract motion vector by segmentation. The method should be background invariant to extract foreground motion. For that you use background models.
The first thing I would do is try to reproduce the results from the authors. Usually you can reach out to the author(s) and ask for the code used for the paper. This is a good starting point because they may provide their analysis framework. However, you may need to set up a more robust framework which allows you to compare results between your methods and the one the authors used.
After implementing their method, there are some exercises to consider doing.
First, what are the videos that failed classification? If you remove these images from the dataset into a separate dataset, rejected_data, you should get 100% accuracy on the remaining data.
What qualitative observations can you make that characterize the quantitative performance difference in their approach? Some of the observations you can make (and construct new features) : day/night, male/female, # people, #actions, fast/slow, cluttered/sparse, verbal/non-verbal. I could imagine adding these additional 'labels' to each video descriptor file in order to boost the accuracy results by using machine learning tools to explore correlation between your new features.
Here are some examples of questions I would ask to try and understand better the reason for performance differences in their results:
- Is background illumination a factor in performance?
- Does the number of people in the scene affect accuracy?
- How do non-moving objects affect performance?
- Does gender/skin tone/clothing/hair/etc. affect performance?
etc...
I can provide motivating questions if necessary but this should be a good starting point. Also, it might be a bit ambitious for one to attempt and define a general purpose strategy for recognition. Many great classification systems are not composed of a single classifier model, but rather an ensemble.