I have a dataset of 10 years of a large-scale assessment of Brazil. The data set has students' characteristics and scores for each year. Moreover has several school characteristics.

I'm working at the school level, and I summarised all student characteristics for each school, including scores. The problem was modeled in a binary classification, then my target (average school score) is 1 for school in the upper quartile and 0 otherwise.

Taking advantage of the panel structure at the schools and Brazilian states levels, I am planning to control for the unobservables, that I assumed being present in two ways:

1) Student' ability, motivation, QI... (time-invariant on average in schools and I plan to eliminate)

2) State Policies (I plan to estimate its influence in school gains based on residual effect of each state relative to the reference state - left out dummy)

Then, to eliminate the bias of 1) I took out the school mean within each school for all features, including the target, which was binarized in the end. Now I have a vector of school demeaned features as dependent variables and a binary target which now means: 1 if the variation around the mean is in the upper quartile and 0 otherwise.

Now, I need to infer about 2) in a fixed fashion (for all periods). Therefore I just add a dummy variable for states. So, the time-variant effect relative to the reference state in the period will be estimated, while time-invariant biases will be eliminated.

Finally, I run this final model.

Is it my approach right under assumptions 1) and 2) in addition to those related to the logistic model?

how far am I from causal results?

More Rogerio Filho's questions See All
Similar questions and discussions