Hello everyone,

The RNA-seq data from TCGA/COSMIC contains the Z-SCORE, instead of RPKM values. I wanted to perform comparative analysis of TCGA data and GTEx data. But the problem is GTEx data contains RPKM and counts values. There is a tutorial explaining the Z-score calculation (http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/gene_expression_comparison.html), but i am not sure it works for RNA-seq or it is the correct tutorial.

Can someone please guide me how to calculate the z-score of genes using GTEx RNA-seq data (RPKM and count data). If possible verify the above link, is it correct code.

Reading the literature and comments, my understanding of the z-score:

1. Convert the count/RPKM values of each gene into log values.

2. Calculate the mean and standard deviation of X gene log values in 20 lung tissues (suppose i have data for 20 samples).

3. For first lung tissue sample: (gene X log value - mean of log values of 20 lung tissues)/ standard deviation of log values of 20 lung tissues.

4. Now. i have the z-score for gene x in first lung tissue sample. Using the above protocol, i can convert all genes log values into z-score.

The question is the above protocol is correct or not, please advised. 

Should i calculate the z-score using reads count or RPKM values.

Does these z-score really have meaning. The z-score COSMIC provide:

ID_SAMPLE    SAMPLE_NAME    GENE_NAME    REGULATION    Z_SCORE    ID_STUDY

1337808    TCGA-02-2483-01    SFMBT1    over           2.416    329

1337808    TCGA-02-2483-01    SGCE    normal          -0.274    329

If i calculate the z-score using above approach, should i be able to calculate the z-score and find out whether the gene is over regulated or normal regulated  .

Please advised how to proceed. 

Thanks

Similar questions and discussions