To what extent can the principles of demographic statistics be applied to corpus research?

07 January 2016 2 3K Report

I am trying to apply a number of statistical measures to my corpus research into some and any. The main measures I am applying are:

1) Measures such as T-score and MI score to produce reliable collocation lists for some and any inside specific structural patterns-questions, negative sentences, conditionals etc.

There is quite a lot of literature on the application of statistical significance scores to collocations and colligations, indicating which scores are appropriate in specific research contexts; for example the MI score is thought to be inappropriate for low frequency collocates. However, I would be interested to hear from researchers who have tried to apply these measures to words. Do you feel that such measures are as applicable in linguistic studies as they are in demographic ones? Or do words require different statistical measures to populations?

2) Confidence intervals and confidence levels to calculate the ideal random sample size for "unwieldy" search results: most of the searches that I am conducting with "some" and "any", produce too many results for me to analyse the whole corpus, thus forcing me to use random samples. I am using a sample size calculator to work out the size of each random sample: I write in the total number of examples that the search produces across the whole corpus and set the confidence level at 95 % and the confidence interval-or margin of error- at 4 percent.

One researcher has suggested to me that this is inappropriate on two grounds. Firstly, "words are not people". Secondly, no corpus is representative of language in its entirety, so it is not appropriate to treat the entire set of results from a search term as the whole population. I understand both arguments but cannot think of a better way of establishing that the random samples used with each search are big enough to be representative of the whole corpus.

Peter Samuels

Dear Chris,

I advised a colleague recently in corpus linguistic statistics but I am not an expert. My advice would be:

Whilst corpuses are not the same as normal populations because the words are joined together grammatically I don't think this is a sufficient reason to say that population based statistical methods cannot be applied. I think you just need to be careful.
Re point 1), I'm not sure what you mean be "some" and "any" here - are they specific words you are searching for in a certain way? My experience with my colleague was there is a danger in the corpus linguistics literature to create ranked lists of words, e.g. by frequency, and then test for significance, which is inconsistent with the principles of hypothesis testing.
Re point 2), similar to above, I would not use confidence intervals to estimate an appropriate sample size. Sample size calculations are based upon a priori estimates of effect sizes rather than any statistic derived from the data itself. Otherwise, you could potentially increase your sample size to prove anything you wanted to.

Chris Turner

Hi Peter,

Thanks for your answer. I think it is important for statisticians and corpus linguists to understand each other, as the former can greatly help the latter to use quantitative data in a principled way.

Re point 1: I am researching the words "some" and "any" and their compound forms-"someone", "anyone" etc, because this is an area of grammar that is poorly explained in grammar books and poses a number of problems to learners of English. I am carrying out searches to see differences in the lexical meaning of "some" and "any" and in their pragmatic meaning (e.g. speaker intention, attitude, expectation etc) in different grammatical environments-inside clauses preceded by if, in questions, in negative sentences etc.

I am not testing for the significance of ranked lists of words. Instead, with all of my main searches into "some" and "any" in specific grammatical environments, I am using the collocation search function in Sketch Engine. This function enables you to generate a list of collocation candidates with difference significance measures-T-score, MI score, log likelihood etc. You can specify the number of words to the left or the right of the search term that you want to consider- this varies, on linguistic criteria,according to the grammatical structure that "some" or "any" enters into. You can also indicate a minimum frequency for the collocation candidates-both in the corpus itself and inside the specific search term.

It is important to stress that I am not using statistical criteria to prove any hypothesis. I am merely carrying out exploratory searches into "some" and "any" to discover how the two words behave. The statistical measures for collocations will simply provide an inital, automatically generated list of collocational candidates whose ultimate significance will be determined by a mixture of quantitative criteria ( e.g. a T-score of above 2 and/or an MI score of above 5) together with qualitative criteria based on my linguistic analysis. In some cases, it is possible that linguistic criteria may lead me to accept collocations with lower T-scores or MI scores-e.g. because I believe that the word forms part of a semantically determined lexical set that often appears in the context of "some" or "any".

With regard to point 2, I am not sure if you have misunderstood me or I have misunderstood you! As far as I understand the process of estimating sample sizes in demographic studies-which is not very far- it is relatively common practice to use confidence level and confidence intervals as criteria for generating a sample size that can be considered to be adequately representative of the whole population. It is, I think, also common practice to do things the other way round- to decide on a sample size according to other criteria and then calculate a posteriori the confidence level and confidence interval of your results/findings. However, the thing is that there are normally no other quantitative criteria that can be used to estimate what would be a reliable random sample size for my searches. It is true that particular searches may prove to contain more variable linguistic behaviour than others, thus making it more difficult to state with any confidence that the sample is a relatively accurate reflection of the results of a particular search across the whole corpus. . If this turns out to be the case with any of my searches, I will be forced to increase the sample size to cater for this variability. However, the question again arises of how to do this. The only way I can think of is to increase the confidence level-e.g. to 99%- and lower the confidence interval-e.g. to 2 %.

It is important here as well to bear in mind that I am not using statistics to prove hypotheseses. The statistical measures related to sample size simply provide me with a degree of statistical backing for deciding on the size of the samples, which are then subjected to qualitative linguistic analysis.

None of the above means that I am hell-bent on continuing with the statistical measures that I propose. I would welcome suggestions for other statistics-based ways of determining random sample size and collocational significance.

Which type of compound does lamda max of 218 indicate in a uv-vis spectrum of a partially purified compound through column and TLC?

Conjugation of PEG-Amine to an Amino Acid Using EDC?

Why do some of my Raman spectra show no signal at lower Raman shifts?

Could you try using PeptiCloud and see if it's a useful tool for biology research?

Are these cassettes suitable for expressing PETase mutant in E. coli?

Does soybean seed coat or cotyledon contain chlorophyl or flavonoids? What types? determined by paper chromatography? other methods (high school lab)?

Does post-translational protein modification cause devisions on observed pI verses calculated pI?

How I can add anew research in my account ?

Can Document Analysis be used as the only method of research?

How much sample we are required to do a pilot study for standardized an Achievement Test?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?