BagOfWords is probably the most popular technique of feature representation for videos and still images in the domain of human action recognition. It is coupled with the other procedures in feature extraction.
The whole recipe includes (a) selecting the space time interest points e.g. harris corners, dense trajectories, etc and spatial or temporal segmentation (b) extracting (low-level) features around those points e.g. hog, hof, motion boundary histograms, motion interchange patterns, sift, etc (c) sampling some from them say 100,000 (d) clustering them (e.g. using kmeans) to make Words say 2000 words (e) vector quantizing your data according to them. Add the other stuff such as distance/similarity/kernel/classifier, normalization, etc accordingly.
In any case BoW is known to provide good data driven representation in this domain.
I used BoW some years ago to try to discriminate human poses (actions).
Here is what I did:
1. learn a codebook (set of visual words)
2. for a new image compute keypoints (e.g., SURF keypoints)
3. map each keypoint descriptor to a word
4. compute histogram of words = Bag of Words = how often does appear each word x?
5. compare current BoW with protoype BoWs (one BoW footprint per pose) and determine the most similar one. The attached pose is the final pose estimate.
Seems to work for discriminating small numbers of poses / actions.
Here I did not use any sequence information, i.e., no STIPs = Space-Time-Interest-Points as proposed by Muhammad Shahzad Cheema.