I was wondering if anyone knows or have published a technique that sucessfully combines shallow (HOG, SIFT, LBP) with deep (GoogLeNet) representation? I am interested both for images and video cases.
One of the easy way to combine is to extract deep feature and other feature and concatenate it as one feature vector and finally normalize it between 0 and 1.
but i think these low level feature will not make any change or improvement as deep feature alone do.
Fischer et al. showed that CNNs outperform local descriptors based on orientation histograms such as SIFT, HOG, SURF. Follow:
Fischer et al.,"Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT", 2014 - https://www.researchgate.net/profile/Alexey_Dosovitskiy/publication/262568634_Descriptor_Matching_with_Convolutional_Neural_Networks_a_Comparison_to_SIFT/links/541fe8dc0cf2218008d41617.pdf
This was confirmed by state-of-art GoogLeNet. Follow:
Szeged et al., "Going Deeper with Convolutions", 2014 - https://arxiv.org/abs/1409.4842
Neverthless, Benenson et al. highlighted "...although some of these features might be driven by learning, they are mainly hand-crafted via trial and error,..." and concluded about deep architectures: "...Despite the common narrative there is still no clear evidence that deep networks are good at learning features for pedestrian detection (when using pedestrian detection training data). Most successful methods use such architectures to model higher level aspects of parts, occlusions, and context. The obtained results are on par with DPM and decision forest approaches, making the advantage of using such involved architectures yet unclear...".
Follow:
Benenson et al., "Ten Years of Pedestrian Detection, What Have We Learned?", 2014 - https://www.researchgate.net/profile/Rodrigo_Benenson/publication/268452043_Ten_Years_of_Pedestrian_Detection_What_Have_We_Learned/links/54f4897a0cf2eed5d734d3ee.pdf
As a compromise, combining deep learning and local descriptors allow to enhance computational performances. Follow:
Lipetski, et al., "A combined HOG and deep convolution network cascade for pedestrian detection", 2017 - http://www.ingentaconnect.com/contentone/ist/ei/2017/00002017/00000004/art00003?crawler=true&mimetype=application/pdf
Lipetski, et al., "Close to real-time robust pedestrian detection and tracking" - 2015 http://booksc.org/book/51048339/446fa6
Alom et al., "Robust Multi-view Pedestrian Tracking Using Neural Networks", 2017 - https://arxiv.org/abs/1704.06370
Finally, Milan et al. proposed very recently a recurrent neural network to address online multi-target tracking. They casted "...the classical Bayesian state estimation, data association as well as track initiation and termination tasks as a recurrent neural net, allowing for full end-to-end learning of the model...".
Follow:
Milan et al., "Online Multi-Target Tracking Using Recurrent Neural Networks", 2017 - https://www.researchgate.net/profile/Anton_Milan/publication/301876852_Online_Multi-target_Tracking_using_Recurrent_Neural_Networks/links/573b367f08ae9f741b2d7ba2/Online-Multi-target-Tracking-using-Recurrent-Neural-Networks.pdf
Regards
Article Descriptor Matching with Convolutional Neural Networks: a Co...
Conference Paper Ten Years of Pedestrian Detection, What Have We Learned?
Article Online Multi-target Tracking using Recurrent Neural Networks