the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then you need to extract features from it. For this, haar features shown in below image are used. They are just like our convolutional kernel. Each feature is a single value obtained by subtracting sum of pixels under white rectangle from sum of pixels under black rectangle...
Their final setup had around 6000 features. (Imagine a reduction from 160000+ features to 6000 features. That is a big gain).
or this they introduced the concept of Cascade of Classifiers. Instead of applying all the 6000 features on a window, group the features into different stages of classifiers and apply one-by-one. (Normally first few stages will contain very less number of features)