Statistical independence test only for Observed>Expected in contingency tables of 9 cells?

03 March 2018 2 9K Report

Hi again.

I have the following problem:

Let say I have two random variables X, Y, discrete, each with three possible values, a,b and c.

I would like to test if in a sample of 500 objects, the two values X and Y are independent in a special way:

H_0: Pr(X=x ^ Y=y) Expected_ij.

In order to calculate p-values, I calculate the distribution of s(T') over a million random tables, each with the same marginal frequencies of T'.

To do so, I generate a random permutation of the values of X on the 500 objects, and another random vector with the frequencies of Y,

calculate the contingency table T' and s(T'), over a million T'

Then, the distribution of s(T') is used to calculate p-values.

But I realize that the null hypothesis H'_0 for the Montecarlo procedure is that X and Y are independent which is not exactly what I need.

Fortunately, for a fixed r, the number of times that there is a T' where s(T') >= r, when H'_0: Pr(X=x ^ Y=y) = Pr(X=x) Pr(Y=y), is greater than

the number of times that there is a T' where s(T') >= r, when H_0: Pr(X=x ^ Y=y) = Pr(s(T') >= r | H_0).

So, if I have a table T, and I calculate s(T) = r, I know that if Pr(s(T')>=r | H'_0) = r | H_0)

Miranda Mortlock

secondly in a Chi Square test for independence there are some assumptions for the data points. Are they independently collected, and occur once only i your table

secondly in a Chi Square test for independence there are some assumptions for the data points. Are they independently collected, and does each individual occurs only once in your table?

Why not start with a test for Independence and see what you get ?

I like to start with a clearly defined research question (in plain English) as this helps decide on the appropriate analysis.

Jaume Sastre Tomas

Hi Miranda,

Thanks for your interest.

The research problem is the following:

A SNP is a random variable that could be somatic (s), germline (g) or nothing (n) according to its mutation in each subject. These are the three values. And we have 500 subjects.

We study the association between two snps and we want to find snp pairs when both are somatic or one is somatic and the other is germline on a subset of subjects that be significant.

This means that we try to find a snp pair where the number of subjects in the cells (s,s), (s,g) and (g,s) are above the expected. If the observed subjects are below the expected is not important at all (it shows dependency but not what we need).

So, we add the deviations of observed from expected of the chi squared test only for these three cells (of the 9) and only if they are above the expected, and call this score s(T).

But then we do not know what is the distribution of this score s(T).

So we use a Montecarlo procedure to get the distribution of s(T). But the normal way to generate the random permutations assume that the snps are random and therefore independent, but not that the Observed < Expected.

So this is the source of the question.

Basically, we think that the distribution we have of our score s(T) is good enough to get the p-values. Everybody does it like that: any distance or score of anything is calculated on millions of random samples and then, they know how easy is to get a value as extreme as s(T_0) or greater for a fixed T_0. But we would like to be sure.

Thanks.

Independence on only some cells of the contingency table?

How to manually calculate p value in ANOVA? what is the equation or formula for it?

Can linear DNA be used as template for site-directed mutagenesis?

Is stone a sustainable material? Is it a sustainable approach to use natural stone waste in architecture?

Is it necessary to report effect size (ES) in statistics of biomedical research given that we already report p values routinely?

PUC19 as a transfer plasmid for lentiviral delivery?

How the values of errors can be calculated for thermodynamic data?

How to get/count refractive index in gasses?

How to get the vector of a specific vibration mode in phonon dispersion?

Specify the Burgers vectors for dislocations, which cannot be uniquely identified.?

I am working on image transmission, i converted the image into binary vector, i want to extract the MSB of each pixel ?