For example. ERBB2 shows two probe IDs i.e, 216836_s_at and 210930_s_at. Both have different expression values. What I have to do with the two values? Do I have to take the mean of the two?
If the expression values for two probes to the same gene are strongly correlated, then it may be safe to summarise them.
If the probes are not correlated with each other in your experiment, then you may have detected differential splicing in your experiment between two alternative isoforms. In this case it is better to either treat each probe as a separate transcript, and consider them separately, or to choose the probe you like best to represent each gene.
For choosing the probe you like best, one approach is to choose a probe which matches an exon which is in most/all of the splice variants of the gene. Another approach is to choose the probe with the strongest raw expression, or the lowest within-conditions variance.
Finally, when you come to qPCR validation of your array experiment, don't forget to design qPCR probes to match the same exon as the array probes you chose, or otherwise, the corellation between expression in array and qPCR will be reduced.
Arrays do contain multiple probes for the same gene, but these will have been designed to interrogate different transcript fragments. There are also probes that recognize non-specific transcripts and thus may interrogate multiple genes (so called promiscuous probes).
This is precisely why one typically analyses data as probe level data for as long as possible, before finally summarizing analyzed results as named genes.
In the event that two or more probes passed your analytical threshold(s) for significance, you may wish to retain the data as distinct probes. Or, for some summary presentations, you could average them - say take the geometric mean of log2 fold change for all probes for a given named gene.
But you should never summarize probes to genes before all your statistical anaylses are complete. Use probe level data for all significance testing, and only summarize by gene ID for final presentation. For example, you may want to, at the end of an array analysis, display a heat map of log2 fold change for the significant genes. That final summary step may require making choices about how to deal with redundant probes, and one way could be to take the geometric mean or median. to display a summary log2 fold change for that gene.
Thank you very much for the detailed reply. So, it means rather using the already available processed data, one should first download all the raw data and perform all the analysis again before summarizing probes to genes.
I have one more question. lets say I want to find expression values of one gene in breast cancer. What might be the possible reasons that the probe identifiers of that gene are not present in the expression files (processed data) of many cancer studies?
This is a huge problem in bioinformatic. Basically, the probes represent a certain locus in the genome. Annotation algorithm assign these loci to genes but these are done arbitrarily with build in assumptions (i.e. most of the time the probe is assigned to the gene with nearest 5' promoter). As Michael pointed out there can be multiple probes annotated to one gene. Conversely, you can have probe annotated to multiple genes. Your most accurate description of the data is at the probe level. You can infer gene regulation but know at the end of the day you are measuring changes the amount of a particular sequence (~60mer probe). Difference between two probes that represent one gene can possible represent multiple splice forms of the gene.
In the scenario you pointed out, in which two probes annotated to HER2 (ERBB2) have different levels of expression, I would assess the raw values. Determine the detection level of the array. If one of the probes is below that level, you can assume that it represent a bad probe. If you have to use one probe, I would use the one with higher levels of detection (i.e. above the detection limit). You can't average the two probes for 2 reasons: 1) they can represent different splice forms and 2) one probe may just not have worked (i.e. you can be misrepresenting the expression of the gene).
If you are comparing array values from different sample and/or experiments, makes sure you are comparing the same probes across the board. You can't average the probes and then compare the averages across the samples. This is introducing huge amount of bias. The easiest way is to pick one probe for each gene, and then note the criteria you used for choosing the probe (i.e. probe was most variable across all sample, probe was had higher levels of detection in most samples, etc)
For your final question, the answer is complicated. There is very limited overlap between array platforms. Information is also lost due to the cross platform probe discordance (Marshall et al Science 2004)). That is an Agilent probe is not necessarily represented in different agilent platforms let alone in other platform like Affy. Quick glance at some of the bigger metadata analysis will reveal that out of 20K possible probes only about 2K end up being used for comparison When you are assessing gene expression across platforms, you need to either do it at gene level or pick probes which are closest in location to each other. Reduce the data to probe level. Then compare different probes. Providing the variation is actually valuable information and maybe informative for your study.
If the expression values for two probes to the same gene are strongly correlated, then it may be safe to summarise them.
If the probes are not correlated with each other in your experiment, then you may have detected differential splicing in your experiment between two alternative isoforms. In this case it is better to either treat each probe as a separate transcript, and consider them separately, or to choose the probe you like best to represent each gene.
For choosing the probe you like best, one approach is to choose a probe which matches an exon which is in most/all of the splice variants of the gene. Another approach is to choose the probe with the strongest raw expression, or the lowest within-conditions variance.
Finally, when you come to qPCR validation of your array experiment, don't forget to design qPCR probes to match the same exon as the array probes you chose, or otherwise, the corellation between expression in array and qPCR will be reduced.
In general, not necessary in your case with ERBB2, you might also want to see in detail what the probes of your probeset are looking at.
If you have several probe-sets for one gene and only one is significant in your analysis, it could be that that they are referring to a different transcriptional isoforms (good for your research) or that it might be not so specific, you should look if some of the probes in the probeset might be binding some other transcripts.
Geneannot from Weizmann Institute might be useful for you array platform.
This is may be the last question. Kindly put some light on this as well and correct me where I am wrong.
Why some probsets donot contain information about some genes. e.g Affy HG U133A probeset using ENSEMBL I cannot find probset ID of PPP1R1B gene. However, in Affy HGU133 PLUS there is a probeset ID for PPP1R1B. Why this is so?
I also downloaded some publically available raw Affy HG U133A data. After preprocessing and normalization using MAS5, I did not find any expression values or any probeset ID representing PPP1R1B gene.
What are the possible reasons. Why there is no information relative to this gene. Any comments will be greatly appreciated.
Microarray results need to be interpreted with a very conservative vision and can be used only for building an hypothesis which requires a successful step of validation.
Microarray technology was ultrapopular few years ago. The advantage of this technology is the screening capability. The disadvantage is the precision. We are assuming that a given sequence (probe) is capable to recognize the expression of the transcript of interest. This statement is based on the knowledge of the genome present at the time of probe designing. In the specific case of the Affy chip, such a sequence was designed many years ago, when the level of knowledge of the genome was not complete (and it is not complete even today!). It could be (I did not check it), that the original probe designed to target PPP1R1B is not specific for this gene and can recognize (also or only) something else. Moreover, if a specific probe is site of a SNP, this may affect the results and you could get no expression or lower expression only for this reason. This fact should be taken into account particularly if you are comparing expression across cells coming from different individuals.
In conclusion GEO datasets may be used to build hypothesis, but the validation of such hypotheses need to be tested with other means. In the era of Deep Sequencing I would recommend to validate your results with qPCR approaches whose primer designing is solidly supported by Sanger sequencing or directly through transcriptome analysis with deep sequencing.
Hi Farhan, you can try InSIlico DB (https://insilicodb.com), search for public datasets of interest, click the analyze button, select the "gene" option and the analysis tool of your choice (I recommend GENE-E for differential expression analysis and heatmap visualization, and GenePattern for more advanced analysis modules)
InSIlico DB collapses probes to genes using the maximum value (if two probes match one gene the highest value is used). CEL files are normalized using fRMA.
Here are two step-by-step tutorials to visualize a diiferential expression heatmap in GENE-E, and doing a Gene Set Enrichment Analysis with GenePattern.
"This is may be the last question. Kindly put some light on this as well and correct me where I am wrong.
Why some probsets donot contain information about some genes. e.g Affy HG U133A probeset using ENSEMBL I cannot find probset ID of PPP1R1B gene. However, in Affy HGU133 PLUS there is a probeset ID for PPP1R1B. Why this is so?
I also downloaded some publically available raw Affy HG U133A data. After preprocessing and normalization using MAS5, I did not find any expression values or any probeset ID representing PPP1R1B gene.
What are the possible reasons. Why there is no information relative to this gene. Any comments will be greatly appreciated."
---------------------------------
Farhan, un-annotated probes in a given array schema are common, but that is not the issue in your example. Affy does update their annotation regularly though, so you always need to be sure you have the most up to date information. The ultimate source for annotation is Affymetrix (or Agilent for their arrays) themselves. You can sign up at Affy's web site for a free account and then search their NETAFFX database pages for the most current annotation they have for their probesets.
In the example you give above, you were simply searching out of date annotation for the U133A array set. Both of those arrays have the exact same probe for that gene, 225165_at, according to Affy's online annotation. However, the older U133A array set may not have originally included any annotation for that probe set (if it was not known or confirmed at the time the array was released), and unless you ensure you are using up to date annotation for it, you will miss it.
A lot of software will automatically check for updated annotation when you analyze data (e.g. Partek, JMP Genomics, even BioConductor for the array annotation databases directly supported by Affy). But whenever you are re-analyzing older data, you need to always ensure you have up to date annotation information for that array's probe sets.
Also, I would not use MAS5 - even Affy has abandoned it as it is known to be a poor normalizaiton scheme. Affy now uses PLIER, but honestly, even it performs poorly relative to RMA or GCRMA.
"Why some probsets donot contain information about some genes. e.g Affy HG U133A probeset using ENSEMBL I cannot find probset ID of PPP1R1B gene. However, in Affy HGU133 PLUS 2 there is a probeset ID for PPP1R1B. Why this is so?"
HGU133 PLUS 2 (Platform ID GPL570 in GEO) is newer and has ~twice the amount of probes as U133A (Platform ID GPL96 in GEO) or U133B (Platform ID GPL97 in GEO). ~20,000 probes for U133A and U133B and ~50,000 probes for HGU133 PLUS 2. The latest is a newer augmented chip.
After normalizing arrays(e.g using RMA with options background correction and quantile normalization) we obtain normalized values of probes. Then we calculate gene expression values using probe values. if 1 gene has multiple probes, we use median value of probes for gene. Finally, we get gene expression values per sample. Should we do again between arrays normalization(e.g. quantile normalization) if the arrays(with gene expression values) seem non-normalized?
Taking an average hybridization intensity will yes indeed introduce bias. I have just done a transcriptome study and I have selected a probe set with the highest hybridization intensities across the samples to represent my genes if there are redundant probe sets. I have chosen in because of people already using this method. The reason why you have to loose the rest is if you are analyzing for fold enrichment, multiple probe sets will introduce bias.
The reason you have probe set IDs which do not match genes depends on the array chip you are using. Let's say HTA 2.0 probe sets are designed to cover the entire human genome in a sequence, therefore some probe sets will hybridize to a sequence that do not code for a gene, thus there wont be a gene annotation information.
Hi everyone, I am working on a project that is based upon analysis of Gene expressions. Though I am required to do the web development part of it, I would like to know as much as possible in a short span of time about gene expression. So I am downloading CEL files from GEO databases which will be fed to a ML pipeline for analysis. But to even verify if I am fetching the correct data from them becomes confusing when I check those CEL files manually.
I take this thread as a platform to ask you researchers for suggesting resources to get even the basic rights in this field quickly.