Hello everyone,
The RNA-seq data from TCGA/COSMIC contains the Z-SCORE, instead of RPKM values. I wanted to perform comparative analysis of TCGA data and GTEx data. But the problem is GTEx data contains RPKM and counts values. There is a tutorial explaining the Z-score calculation (http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/gene_expression_comparison.html), but i am not sure it works for RNA-seq or it is the correct tutorial.
Can someone please guide me how to calculate the z-score of genes using GTEx RNA-seq data (RPKM and count data). If possible verify the above link, is it correct code.
Reading the literature and comments, my understanding of the z-score:
1. Convert the count/RPKM values of each gene into log values.
2. Calculate the mean and standard deviation of X gene log values in 20 lung tissues (suppose i have data for 20 samples).
3. For first lung tissue sample: (gene X log value - mean of log values of 20 lung tissues)/ standard deviation of log values of 20 lung tissues.
4. Now. i have the z-score for gene x in first lung tissue sample. Using the above protocol, i can convert all genes log values into z-score.
The question is the above protocol is correct or not, please advised.
Should i calculate the z-score using reads count or RPKM values.
Does these z-score really have meaning. The z-score COSMIC provide:
ID_SAMPLE SAMPLE_NAME GENE_NAME REGULATION Z_SCORE ID_STUDY
1337808 TCGA-02-2483-01 SFMBT1 over 2.416 329
1337808 TCGA-02-2483-01 SGCE normal -0.274 329
If i calculate the z-score using above approach, should i be able to calculate the z-score and find out whether the gene is over regulated or normal regulated .
Please advised how to proceed.
Thanks