My question concerns acceptability ratings in linguistics. Following the advice in i.a. Cowart (1997) and Bross (2019), I included filler sentences in my questionnaires. Some fillers are also so-called benchmark sentences - completely ill-formed sentences (thus, they should be marked as unacceptable) or perfectly grammatical (thus, perfectly acceptable). I use them to check if the participants pay attention and if the answers can be relied upon.
While visually exploring my datasets, I noticed that some (rather few) participants did not pass the "benchmark test." I will probably have to remove a few sets of answers from my dataset, but how many? If let's say, I had five benchmark sentences (either very ill-formed or perfectly acceptable), how many need to be marked according to my expectations to count a submission as reliable? What are your experiences with the topic?
I will be grateful for answers, suggestions, and references.