I have a dataset composed of several subjects. Each subject has a series of binary indicators where 1 indicates that a the subject presents an indicator and 0 means that the indicator is not present. These indicators are grouped into 5 categories each one composed of a different number of the aforementioned indicators. If a subject has more indicators present at a category it means that the category is stronger in that subject.
I want to aggregate all the binary indicators in each category into a single real value for each subject. The method that came to my mind is using the number of indicators with a value of 1 divided by the total number of indicators. That way a subject with all the indicators equal to 1 in one category will get a maximum value of 1.0 in that category, likewise, a subject with half the indicators with a value of 1 and the other half with a value of 0 has an aggregate value of 0.5 in that category. I am new to data science and I am not sure if this is the best approach. What do you think ? Does this aggregation makes sense to you ? Do you know any other possible aggregations ?
Below I attach a sample toy dataset with 3 subjects, 2 categories and 2 indicators per category to further explain my problem:
| | Indicator1.1 | Indicator1.2 | Indicator2.1 | Indicator2.2 | |----------|--------------|--------------|--------------|--------------| | Subject1 | 1 | 0 | 1 | 1 | | Subject2 | 1 | 1 | 0 | 0 | | Subject3 | 0 | 1 | 1 | 1 |
In the example above, Indicator1.1 and Indicator1.2 belong to the same category, likewise, Indicator2.1 and Indicator2.2 also belong to the same category. With the aforementioned aggregation method of the ratio the categories real values will be:
| | Category 1 | Category 2 | |---------- |------------ |------------ | | Subject1 | 0.5 | 1.0 | | Subject2 | 1.0 | 0.0 | | Subject3 | 0.5 | 1.0 |