Can you compare reduced representation bisulfate sequence data with Illumina 450K methylation arrays? Thinking about doing some Bioinformatics on Epigenetics data. Cheers.
Yes, you first need to identify the CpG sites covered by both assays. Then you could calculate the correlation between average methylation from RRBS and 450K for example
Ok, that's helpful to test whether they're comparable. What I want to do is run a meta-analysis across datasets generated by both methods. Would a cohort normalisation, like ComBat, do the trick? I usually work with expression data, does methylation have similar 'housekeeper' genes we would expect to be similar across all samples? With expression I have also run density plots for each sample to test whether they have similar distributions, although I understand this may not apply since beta values are not normally distributed.
"What I want to do is run a meta-analysis across datasets generated by both methods. Would a cohort normalisation, like ComBat, do the trick?"
- I am quite sure there is no software (yet) that will "normalize" methylation data from the different platforms.
----------------------
I usually work with expression data, does methylation have similar 'housekeeper' genes we would expect to be similar across all samples?
-Yes, there are gene promoters and CpG islands with known methylation status. Most of them are unmethylated in the "normal" state, but some are fully methylated. There are also imprinted genes, where you would expect to find one allele methylated and one allele unmethylated, so ~50% methylation.
----------------------
Someone in my lab is trying to do pretty much exactly what you posted and is running into quite a few problems. There are many things to look out for. I'll start with what pops into my mind first, but if you like I'll interview him over a cup of coffe and make some notes. Let me know :-)
1. you can only investigate CpGs that are covered by both assays. RRBS is limited because the enzyme does not cut all CpGs, 450K is limited as only the CpGs that are covered with a probe are analyzed.
2. "run a meta-analysis across datasets generated by both methods" sounds like you want to take the ready-annotated output lists. In our case, the gene annotation is different on both platforms, making it difficult to compare datasets. Unless you start anaylsis from scratch with raw data (which is usually not done in a meta-analysis, I think), you may well have to standardize the gene names or be prepared to leave ambiguous ones away.
3. Gene names get even more complicated when you consider that some genes have a multitude of isoforms. Unfortunately, for some genes, the annotation of isoforms is incomplete, incorrect, or out-of-date, making standarization of names even more difficult.
In short, from what i have heard and seen: the further back you go to the raw data, the easier it will be to compare the two datasets. If you plan on using annotated output lists, you will have to accept some (quite substantial) losses.
Before I continue, please let me know if this is going in the right direction. I'll be glad to post more or get in touch directly if needed. And: lots of people are doing stuff like this at the moment. Be sure to check pubmed regularly for new ideas and tools to analyze methylation data.
Good luck!
Nick
PS: Was that photo made at Cologne main station? seems so familiar.... :-)
Thanks Nick, this is really helpful. I've encountered some of these problems with different gene expression microarray platforms before. I think it's in house data rather than public so I should be able to get raw data for each platform.
My collaborators who've generated the data have far more expertise on Epigenetics than I do but they're lacking in bioinformatics experience. I think they know there's a need to develop a method to do this but they don't understand how difficult it will be. I suppose it depends how concerned they are about losing genes that don't match up between the two. I'm open to using existing tools, including those of your student, if possible.
Well spotted, I traveled to Cologne last year. I'm keen to travel again but it's a long way from NZ.
Glad I could help, and good to hear you have the option of accessing the raw data!
Our lab (including me) also specializes in Epigenetics. I am one of the few in our group with a certain affinity to bioinformatics, so I know how difficult (or near to impossible) it can be to explain the difficulties in certain data analyses to colleagues. I am glad though that my main point (i.e. that this is not going to be an easy task) came across.
As for existing tools:
As you have access to raw data, "RnBeads" is definitely worth looking in to : http://rnbeads.mpi-inf.mpg.de/
It should take raw data form both platforms. However, you will have to check the overlap of CpGs covered in both assays (as mentioned in the previous poosts) as they have completely different approaches.
Our lab decided to accept the losses and go with what remains. (i.e. we loaded annotated lists into excel) This approach has the following caveat: The genes that are best investigated are those that are best annotated. This includes genes with many isoforms and a long history of name adaptation. Therefore, leaving out genes with such discrepancies will just eliminate well-known genes from your lists... "there may be more optimal solutions" is probably be the most diplomatic way to describe this ;-)
I would be delighted if you could keep me updated on your progress (of the analysis pipeline, not the details of your analysis, of course!) - maybe my colleague could also profit from some of your insights.
Good luck and just post here or message me if you have any further questions :-)
Thanks, I'll check it out. The core focus of my research is still gene expression so this may take a while. I'm certainly interested in getting involved in Epigenetics. My collaborators have contacted my supervisor who has even less available time or confidence that it could work. The consensus seems to be this is not a quick fix type of job.
Ah, well it seems interdisciplinary teams keep us busy where-ever we are! I'm still setting up a meeting with the Epigenetics lab (currently in contact with a postdoc). This has been good to go informed on how much work this could spiral into. Unless we have to I'll try stick to well annotated genes. I'll let you know if they bring up more bioinfo questions or if (by some miracle) we develop a working pipeline. I'm keen for it to get used if we end up releasing one.