I am working on RNA seq data. I want to reduce the dimensions and select the top genes? Can anyone suggest what statistics can be applied?

Genes cooperate in gene expression networks, thus the main point is that the 'actual dimensionality' of the system is drastically reduced with respect to the number of genes. The classical way (applicable in any situation with a very high number of concurring elemnts) is like this:

1. Consider samples as variables and genes as statistical units, then compute PCA on the resulting matrix, if the samples refer to the same tissue you will come out with a by far main first principal component in order of percent of explained variance (it is the so called 'batch effect', correspondent to the fact each tissue or celular kind has an 'ideal profile' of gene eexpression levels). Consider the subsequent components (each other linear independent by construction and thus pointing to indpendent biological latent concepts), until you reach the 'noise floor' (only noise, i.e. very minor part of variance explained by minor components). let's imagine you end up with 3 principal components.

2. Each component has associated its loading profile. Loadings are the correlation coefficients between variables (samples in your case) and components. Higher loadings (in absoluye value) mark the leading variables for component interpretation. If you have, let's imagine, cancer and normal samples, the component in which cancer and normal samples have opposite sign loadings is 'cancer effect'. At this point look at the score of genes on different components, if you extracted the principal components from between samples correlation (not covariance) matrix your component scores will have the form of z-scores (zero mean and unit standard deviation), select those gens having a score > 2 (or < -2) and you will get what you want.

Here some (out of the many) examples:

Article Mining gene expression data by interpreting principal components

https://www.nature.com/articles/s41598-020-62804-3

Article Cell Fate Decision as High-Dimensional Critical State Transition

Article A Systems Biology Strategy on Differential Gene Expression D...

Terezinha Souza

First, you could filter for low-expressed genes (e.g., set a cutoff of average 10 reads across samples, remove genes with zero counts etc). Then, you could select the top n most variable genes across samples.

If your experimental design includes distinctive groups (e.g., treated vs control) you could estimate differential expression with packages like DESeq2 or EdgeR in R.

Osman Radwan

If your intention is to select the top genes before applying statical analysis, you need to remove genes with 0 counts and those with low counts across samples. After applying edgeR for example, you may sort genes by FDR value and Log2FC

• What the possible Persistent Organic Pollutants and Heavy metals present in fluorspar, sediments, and water bodies around its mining area?

Which type of compound does lamda max of 218 indicate in a uv-vis spectrum of a partially purified compound through column and TLC?

Is there a problem with my RNA pellet?

Can you connect an HPLC to a Mass Spec only at a certain time point?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Strugglling with m6A dot blot any suugesstion ?

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

RNA Extraction Using Hot Borate Method No Longer Working?

Is it true that $\det(V(A))$ may be only $\pm 1$, depending on $n$, for the last symmetric tridiagonal matrix $A$?

Does Anyone have expertise in in vitro transcription and RNA pull down assay?