I have been analyzing examination data from students and have been tasked with identifying whether any of the examiners were unduly harsh or lenient.  The problem I've been having is that the design of the exam is poorly suited for identifying harsh examiners.  Students proceed round 10 stations which each have a different examiner providing the only grade for that station.  Furthermore, no one examiner saw all of the students.  Different examiners were assigned to the same station at different points during the exam.  This leaves us in the situation where the variation in the students performance was due to 

  • Changes in student ability
  • Changes in the station they're on
  • Changes in the examiner they're being assessed by
  • These different types of variance are all intermixed and I don't see an easy way of separating them.  Someone mentioned Rasch analysis to me, although I struggled to find a user friendly tutorial for it.  As far as I can tell it is about modelling the student's ability as a latent trait, and I'm not sure how that would help me with the problem of identifying harsh examiners.

    What I have done for this year's class was a compromise.  I plotted the mean score across all stations awarded by each individual examiner and then converted them to z-scores so that I could identify which examiners tended to issue harsher or more lenient scores.  In one case I found an examiner whose average score awarded was -4 standard deviations from the overall mean.  However on closer examination it was found that the specific students they had assessed were found to be quite poor in other areas as well.

    For next year's exam I was considering a different approach.  I was thinking of creating 10 regression models with one using each station's score as the outcome variable and all the other stations as the predictors.  That is, for each station I would model what the student's score should have been  based on their performance at other stations.  I will then create standardised residuals for each student at each station and see how far their performance on any one station is from what their overall performance says it should have been.  Because I'll only be interested in modeling my own specific dataset most of the usual assumptions you worry about with linear regression wont apply since I don't care about generalisability.  With this approach I would simply make a list of the examiners involved in scores that appear "extreme" relative to the rest of that student's performance.  If the same examiners keep appearing on the list, it is a fair bet that they are out of step with the other examiners in how they grade.  Any one extreme score could simply be the result of a student doing poorly on an individual station, but a pattern of extreme scores attributed to a single examiner would be suggestive of a problem with their marking.

    I was concerned that I might be reinventing the wheel here and that there may already be codified ways of dealing with this problem (for example the aforementioned Rasch modelling or some sort of LME model).  I'm fairly new to modeling approaches any more sophisticated than normal regression.  Accordingly I wanted to pick the brains of the experts on here.

    What do you think:

    Is what I'm proposing a good way of dealing with this problem (without redesigning the exam of course!)?

    Is there a better way or a pre-existing way that achieves the same thing?

    Thank you for any help you can provide. 

    More Gavin F Revie's questions See All
    Similar questions and discussions