I am trying to integrate deep web query interfaces. For using machine learning techniques, I need to create training set, so that I can build model and I can use this model for integrating new data sources. I am not getting any idea for what features I can collect from these query interfaces(web pages) to classify them. For example, if I take book domain web pages, I want to create a common interface from all several book domain web pages. I need to find the correspondences among the elements of individual web pages to create common page.