I am searching for a study which examined the number of annotators for creation of a reliable corpus for a text classification task evaluation.

Snow et al [1] argue that on average 4 non-expert raters are required for annotation tasks but the tasks described are no classification tasks (only the study on affectual data might be considered as classification task). I'm rather searching for a statement for topic-based classifications.

Often, three annotators are used and a majority-voting is done but without real evidence that this a sufficient number...

Thank you very much in advance for your answers!

[1] Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 254-263. 

More Sebastian Schmidt's questions See All
Similar questions and discussions