I would like to reduce sequences of about 50 RGB frames of 640x480 pixels to get a representation of the data I could feed into an deep neural network. The goal is to recognize a performed activity/gesture.
I have seen many examples for individual images with static gestures but I struggle to find practical examples using whole sequences with a dynamic gesture.
I have worked through the tutorial here* and they use the MNIST dataset to train the network. So their input are images of the size 28x28 pixel. I would like to use my data as input but I don't really know how to reduce and how much reduction is enough/necessary.
What I did until now is remove the background and then perform edge detection using the openCV Canny edges algorithm** which works fine but still leaves me with a lot of data.
I tried using image flow to generate something like a heatmap but I am not very happy with the results. I read about DTW or Space Time Shapes, but have not yet found a way to apply the theory.
So, do you have any hints, tips or links to papers, tutorials, presentations or whatever to help me reducing the video sequences without loosing to much data? I would prefer practical examples.
Thank you!
* http://www.deeplearning.net/tutorial/DBN.html#dbn
** http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_canny/py_canny.html#canny