16 January 2014 8 3K Report

I am thinking about a project where I use data from a kinect sensor to recognize gestures. I will get a training set of 14.000 labeled scenes from a vocabulary of 20 gestures. I intend to use the multimodal inputs (RGB, RGB-D and Skeleton) on a Deep Belief Network.

The idea is to implement 3 phases: Unsupervised Feature Learning with Restricted Boltzmann Machines to initialize supervised feedforward neural networks with backpropagation and then testing.

I am not yet sure about the preprocessing of the data. But I guess I will have to do a lot to compress the input. Any ideas here are highly welcome, too.

Does that sound like it makes any sense at all? Will I need a supercomputer or something? How can I calculate the computing time? Are there any ground rules?

The idea is inspired mainly by the paper "Multimodal Deep Learning" by Jiquan Ngiam, Aditya Khosla, Andrew Y. Ng and others. I am pretty new to all this and would love to hear opinions or advice. Anything is welcome. If you need some more input, I am happy to provide if possible.

Here is another question I have about the project:

https://www.researchgate.net/post/Deep_learning_implementation_in_Java

More Ruben Glatt's questions See All
Similar questions and discussions