In the attached file. I have something like a contingency table. But in every group, there are two biological replicates instead of one. How can a chi-square test be used in this situation and if it can't be used what test do you suggest?
The traditional test of association for a stratified contingency table is the Cochran–Mantel–Haenszel test. In my opinion, there are a couple of drawbacks with this test. First is that, depending on the software implementation, it can somewhat tricky to be sure your three-way contingency table is set-up correctly to test the effect you want. Second is that the test does require an assumption that there is a homogeneity of odds ratios across strata, but this is easy to test. Finally, there are various tests including some combination of the names Cochran, Mantel, and Haenszel, and in different software it can be difficult to know which test is which. (You think I’m being silly, but I’m not.)
A more flexible approach is to use logistic regression. For this approach, you’ll need to make one variable explicitly the dependent variable. If the dependent variable has only two levels, then the logistic regression is pretty straightforward. For a stratified sample, you can think of the model as analogous to an anova with blocks. You can get a p-value for the effect of interest and one for the blocking variable. You can also add the interaction of these in to the model.
All that being said, to be sure any answer is appropriate for your situation, you'd have to explain the experimental setup with more specificity. For example, Lee Curley thought you had some kind of repeated measures or paired situation, which I didn't even consider... It's not clear, to me anyway, what is meant by "biological replicate" in your question.
Hi Mohamad, I would recommend the McNemar test, it is basically a chi-square for replicated data (although that is an oversimplification). Please find link to youtube video explaining test and how to do on SPSS: https://www.bing.com/videos/search?q=mcnemar+test&view=detail&mid=58BDF4904BC0E520B28E58BDF4904BC0E520B28E&FORM=VIRE
It depends what question(s) you'd like to answer with these data. Not knowing that (or the variables involved), it's hard to give you a useful response. Perhaps there's someone local that you could converse with in greater detail to help sort out both the nature of your questions and of your data.
Chi-square contingency table analysis presumes: (a) the data are frequencies (counts, not measured scores); and (b) the cells are mutually exclusive (a case appears in one and only one cell); and (c) you have sufficiently large expected cell frequencies in each cell (to have confidence that the computed chi-square follows the theoretical distribution of chi-square well). The second condition (b) rules out replications. That doesn't mean, however, that you couldn't reconfigure the data to conform to this condition.
The traditional test of association for a stratified contingency table is the Cochran–Mantel–Haenszel test. In my opinion, there are a couple of drawbacks with this test. First is that, depending on the software implementation, it can somewhat tricky to be sure your three-way contingency table is set-up correctly to test the effect you want. Second is that the test does require an assumption that there is a homogeneity of odds ratios across strata, but this is easy to test. Finally, there are various tests including some combination of the names Cochran, Mantel, and Haenszel, and in different software it can be difficult to know which test is which. (You think I’m being silly, but I’m not.)
A more flexible approach is to use logistic regression. For this approach, you’ll need to make one variable explicitly the dependent variable. If the dependent variable has only two levels, then the logistic regression is pretty straightforward. For a stratified sample, you can think of the model as analogous to an anova with blocks. You can get a p-value for the effect of interest and one for the blocking variable. You can also add the interaction of these in to the model.
All that being said, to be sure any answer is appropriate for your situation, you'd have to explain the experimental setup with more specificity. For example, Lee Curley thought you had some kind of repeated measures or paired situation, which I didn't even consider... It's not clear, to me anyway, what is meant by "biological replicate" in your question.
Thank you all for your answers. I didn't explain because it is a little complicated but basically this is RNA-seq data comparing two conditions (two replicates for each). The first row is the number of matches in a specific site and the second row is the number of mismatches at the same site.
You'd still have to explain more about what is meant by replicate. Is replicate 1 in bile somehow the same replicate as replicate 1 in lb? Or would be be just as good to call them rep1, rep2, rep3, rep4 ?
Thank you so much for your followup. The mismatches happened randomly because of low accuracy. But it can be also a result of "base modification". The hypothesis is that the difference between the two conditions is real and not due to low accuracy. Actually more replicates will be better (at least three) but for reasons beyond my control, I had to settle. Replicates are independent. They are from four bacterial cultures grown in two conditions (two with and two without bile).
So, Rep1 Bile isn't in any way the same as Rep1 Without. They are from separate dishes. If that's the case, my recommendation would be to label them Dish1, Dish2, Dish3, Dish4 to avoid confusion. And then you can pretty much ignore Rep. If this is the case, the CMH test I mentioned won't be applicable, but a logistic regression will be. Each dish is a separate observation, but you don't need Rep or Dish in your model. Later I'll try to add what I get for results below.
Below is my approach to the problem. The code is in R, and can be run in R or at https://rdrr.io/snippets/ . Unfortunately ResearchGate loses some of the the formatting. The results here are indicated in comments with #. In the logistic regression output, some useful information is the p value for Condition, the proportions and confidence intervals for those proportions, and the odds ratio. Below that output, there is just a summary table of counts and proportions for reference.