In psycholinguistic norming studies 15-20 raters per word per scale are somehow rule of the thumb. However, I cannot find the psychometric explanation or justification for this, although. Does anyone have the reference considering this question?
I am old-fashioned. Depends on "how reliable" you want the result to be. This is like asking "how many similar items do I need" to form a reliable measuring scale (or, e.g., how many items loading how heavily to form a good factor in factor analysis, etc.). Well, the minimum, I'd say, is always 3. The old rule of thumb I've always used is that you need 10 cases per variable measured (per set of conditions, etc.). Using this "rule," you would really need 20 cases to measure/estimate both a mean and a standard deviation and use both going forward. (Remember how errors can propagate though an analysis.) Actually, it is good practice and almost always possible to estimate the "significance" of any estimate of any parameter, usually based on the number of instances you have and the theoretical distribution you expect to see.
Hi, Milica. Thee is a huge amount of literature and many tools for sample size and power determination. Google "number of cases needed to estimate a parameter." My first hit is a good one: http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Power/BS704_Power_print.html
The rule of thumb of 10 cases per parameter is still a pretty good one though, I think. Here is a link that mentions it: https://www.statisticssolutions.com/sample-size-formula/
Incidentally, 15-20 raters "per word per scale" would seem to require a very large number of raters ... and a complicated inter-rater reliability analysis, especially if each rater rated multiple words and you needed to deal with with-in rater variability as well. There are lots of human internet rater services. Maybe you are using one. Good ones probably use advanced analytics to pool rater results, allowing for rater agreemeet variability and other factors.