A 'case' sample and a 'control' sample have been sequenced using Illumina platform. After bioinformatic processing, my database contains million of reads, each consisting of 3 nucleotide (codon) for which I would like to test the SNP.
In the control, only one codon was expected, i.e. AAT, but it really occurred with 99.186% (=4,308,215 out of 4,343,588 reads). Therefore, 0,814% (100-99.186&) could be considered a sequencing error (or base miscalling). Such 0,814% comprised all possible other codons (AAG, AAA, ATA, etc.).
In the 'case' sample, AAT was the most represented codon (86,836%), but AAC also occurred at 12.26% (197,248 out of 1,608,930 reads). In the control, AAC occured at 0.016% (708 reads).
Terefore:
Control:
AAT 4,308,215 (99,186%)
AAC 708 (0,092%)
Total 4,343,588
Case:
AAT 1,397,138 (86,836%)
AAC 197,248 (12,260%)
Total 1,608,930
I would like to test if 12.26% AAC occurrence in the case is significantly different from the control.
This is just an example, and perhaps it does not require statistics for their biological significance, but the concept is useful for me to test several other codons.
Now, my questions are:
1. Since I would test proportions (e.g. 197,248/1,608,930 vs. 708/4,343,588), I think that data follow a binomial distribution (or Poisson, due to 0,9