I am trying to apply a number of statistical measures to my corpus  research into some and any. The main measures I am applying are: 

1)  Measures such as T-score and MI score to produce reliable collocation lists for some and any inside specific structural patterns-questions, negative sentences, conditionals etc.

There is quite a lot of literature on the application of statistical significance scores to collocations and colligations, indicating which scores are appropriate in specific research contexts; for example the MI score is thought to be inappropriate for low frequency collocates. However, I would be interested to hear from researchers who have tried to apply these measures to words. Do you feel that such measures are as applicable in linguistic studies as they are in demographic ones? Or do words require different statistical measures to populations?

2) Confidence intervals and confidence levels to calculate the ideal random sample size for "unwieldy" search results:  most of the searches that I am conducting with "some" and "any", produce too many results for me to analyse the whole corpus, thus forcing me to use random samples. I am using a sample size calculator to work out the size of each random sample: I write in the total number of examples that the search produces across the whole corpus and set the confidence level at 95 % and the confidence interval-or margin of error- at  4 percent.  

One researcher has suggested to me that this is inappropriate on two grounds. Firstly, "words are not people". Secondly, no corpus is representative of language in its entirety, so it is not appropriate to treat the entire set of results from a search term as the whole population. I understand both arguments but cannot think of a better way of establishing that the random samples used with each search are big enough to be representative of the whole corpus.

Similar questions and discussions