Hello I am running a Random Forest and Decision Tree model and need to convert my categories into dummy variables since Python and its libraries are not able to hand categorical string data.

Please refer to the attached screenshot for the data table.

Of course I have other people in my dataset aswell and because of that I would drop a university and a degree type to avoid multicollinearity. My question is that since Harvard was attended twice in BS and MS is it ok to have a 2 there instead of 1? For example if another person Alan attended Havard and MIT then for BS and MS then they would have a 1 for both universities and 0 for other universities. I understand that giving Harvard a 2 for John makes this variable twice as strong compared to Alan but shouldn't it be justified since they attended the university twice.

Or is it that dummy variables can only take the binary form of 1 or 0.

Please explain with scientific references if possible since this is for my master thesis.

Thanks in advance.

Similar questions and discussions