Huge topic I think. Take a look in this paper for a specific stereo matching cost function implementation and some state of the art brief survey. If you want sparse and not dense matching for depth extraction from images check point operators.
I haven't worked on this specific task, anyway I'd suggest to use the horizontal block cross-correlation to find the best displacement of the block central pixel. It's a little bit cost intensive but it should work well.
Also, if the min-max distance range is known, the displacement can be computed only on a limited subset of possible displacement (geometrical considerations can help in the definition of min-max displacement in relation with cameras baseline)
These days, stereo vision has been wildly improved due to the feature extraction methods. Keypoints are the best and most reliable features which use in many filed.
I do not recommend SIFT as it has been shown to not be better than plain neural networks. SURF is more reliable, but it is a general algorithm for feature detection - you are better off using an algorithm that has been designed specifically for stereo matching.