The answers others have given are to-the-point but I THINK you are asking a more global question. In fact, this is the major question facing human disease genetics. As the answers above indicate, there are comparisons for specific techniques (RNAseq, association) and for comparing very stable exons. But non-coding sequences? Not yet.
The problem is, simply comparing "sequence" is not likely to be fruitful because at that level it is probably structure that makes a difference and we don't understand that yet.
Please, specify what is meant by comparison in the case of gene sequence. If you can express gene sequence in ratio scale and want to know weather gene sequence of diseased and non diseased members of a species differ from each other significantly or not, analysis of means can be suggested. In the case of gene sequence by types of disease, ANOVA may be helpful. There is number of test to compare it including kappa coefficient, however, selection of a statistical method depends on the hypothesis to be tested and data type.
Is it for the comparison of proportions of allele occurrences in individuals having different diseases? If so, you have (allele-)frequencies, for each disease you have a frequency distribution. A simple way to statistically compare frequency distributions is the Chi-squared test. It gives a p-value telling you how likely you can expect the observed or a more extreme deviation between the distributions under the assumption that all the frequencies were sampled from the same distribution. Better though would be a more quantitative analysis with a binomial or multinomial logistic regression (binomial if you have just 2 different alleles, multinomial otherwise). This is usually done by the help of generalized linear models.
When you say you want to compare the gene sequence of diseased vs. normal, are you talking about comparing read density from an RNA-seq experiment, probe expression levels from a microarray, SNPS, nucleotide conservation?? You need to explain a little as to what you are trying to do.
The answers others have given are to-the-point but I THINK you are asking a more global question. In fact, this is the major question facing human disease genetics. As the answers above indicate, there are comparisons for specific techniques (RNAseq, association) and for comparing very stable exons. But non-coding sequences? Not yet.
The problem is, simply comparing "sequence" is not likely to be fruitful because at that level it is probably structure that makes a difference and we don't understand that yet.
I disagree! I think it depends on what approach one uses when facilitating the comparison of non-coding DNA sequences. Searching for conservation without taking into account evolutionary inversions of DNA patterns is probably one of the reasons why it remains a challenging problem. Next-gen sequencing technology does however, have the sensitivity to compare non-coding DNA sequences across conditions. For instance, you can now look at alternative promoter usage using typical RNA-seq pipelines as well as putative distal and proximal enhancers based on read densities in non-coding portions of the genome. I wouldn't personally go so far as to say that it is unlikely to be fruitful. We are still at an early stage of understanding "sequence", without which it is unlikely to understand "structure".
So far as I could understand from your question, you are trying to find out whether the diseased sample differs statistically and significantly from non-diseased sample....
Gene Sequencing is little known to me how it is done, but so far as comparing two-samples is concerned I would recommend an independent sample t-test. I don't think ANOVA and Chi-Square can be used in this case.
Or may be you need to provide a bit more input so that suggestions can be made more appropriately...
Mohsin, you say "Next-gen sequencing technology does however, have the sensitivity to compare non-coding DNA sequences across conditions." No, "sequencing" is neither sensitive or insensitive. It is the analysis of the sequence that counts. Even if the sequence is 100% accurate (which, right now, it very often isn't), it is unclear if changes in the sequence are going to be consistent in controls and enough different from cases (assuming one even has identified the relevant section(s) of DNA) that one can spot the differences based strictly on statistics. This is the problem I struggle with now. What might work for exons with their high evolutionary conservation may well not work for non-coding regions.
David, I am not claiming that next-gen sequencing is perfect nor am I suggesting that the analysis of the sequence is unimportant. What I am saying is that the aforementioned technology can "help" identify non-coding regions of the genome that "might" be active/repressed in a specific tissue of interest. This especially holds true when you combine it with epigenetic analysis using ChIP-Seq which may provide clues as to the conformational states the chromatin in that given tissue is likely to be in. Concordance of the two should give insight into likely candidate non-coding regions. Of course, statistics alone cannot provide definite answers but it can provide a measure of likelihood. This is why it is important to use biological and technical replicates to measure the extent of variance. Ultimately, wet lab should validate at least some of these predictions, which collectively should make it possible to better understand sequence. With regards to evolutionary conservation, it depends on what you mean by conservation. Just because exons are conserved in the traditional sense does not necessarily mean that non-coding DNA should follow the same pattern. In some cases, non-coding DNA has in fact been shown to follow the same pattern of conservation as exons. However, in other cases, it might not be so straightforward because of the possibility of evolutionary transformations/inversions complicating the analysis.
Depends on context: family data ( parents w/affected children ) = TDT test. You can also look for excess IBD amongst affected sibs (basically a chi-square with Mendelian ratios as the null model). For large panels of "diseased" and "non-diseased", you are basically talking GWAS, and the typical analysis there is a logistic regression of case/control status onto genotype for each marker (which is equivalent to a chi-square or t-test on allele frequency differences, but allows you to account for co-factors such as age, sex, smoking status, etc.), but there are other approaches as well (reviewed partially in http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1003258.
There is an extensive literature on all of the above forms of analysis, and software available for most/all of it.
It is a very very generalized Question. You probably might want to narrow down your research question..
Ask yourself the following question
1. Where did the data come from? Eg: Blood, fresh Tissue , Formalin fixed tissue etc.,?
2. What technique did you use or planning to use in order to find the gene expression? Eg: If you are using microarray you will end up using different statistical methond. If you are using RNA-Seq you will end up doing negative binomial assuming a generalized linear model.
3. What is your research question? Eg: Are you interested in looking at any up-regulated/Down-regulated genes (or) interested in finding the differential exon usage (or) find transcript levels ?? If so which gene ?? Whai are your case/control ??
Doing statistics with very little/No knowledge in statistics or even doing statistics with very little knowledge on your data itself is a very dangerous thing to do..
Venkatesh and Mohsin: I do not disagree with what you are saying but I was interpreting the question generally. You both are contextualizing the question in terms of, for example, gene expression, microarrays, or other techniques. But I think you would both agree that we are not yet in a position to compare raw (randomly chosen) sequences from cases and controls. Even if we KNOW that a certain genomic region, or a even a certain gene is involved, defining what in the sequence is "wrong" by simple comparison (except, perhaps, if the changes are in exons) may not (I would say "usually does not") work. There are plenty of linkage signals that have been found over the years that have not led to identification of a mutation or other consistent identified change. I maintain this is because we simply do not have enough knowledge (yet) to know how the genome works or know what kinds of non-coding changes or the variety of changes we are looking for.
Single nucleotide polymorphism is linked to susceptibility of diseases. It depends whether your sequences reveal SNPs ( synonymous or nonsynonymous: nonsense or missense). The identified SNPs may or may not reveal the clinical association. There are limited disease association studies. I agree with David's comments that we do not have enough knowledge (yet) to know how the genome works.
I do agree with you, but in the process of exploring a gene or any biological element for that matter we always end up getting nonsensical results which never agrees with our hypothesis.
There are 100's of times when the dogma in science was broken.. Eg: discovery of cDNA.. We always had/ and will have a very little knowledge in science.. My view here is how we use that .000000001 % of our knowledge to understand science better and take this world of research to the next level..