The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
Note
Feature extraction is very different from Feature selection: the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features.
It seems to be the hard task, cause I don't know any way to formally define your "textual aggregation"- operation. To start, I see two possibilities. The first is to define it a priory by the set of the rules, but it looks like hard-coding and isn't agile. The second is to use AI-methods to derive some rules from texts. Or my be something third...
it's easy task if i have the ontology of domain to select the last common ancentre, but if i don't have it, i will propos i new approach that select the most represntative keywords in the corpus