Question

Microarray | RNA Seq | Methylation Arrays - Correlations?

0

Entering edit mode

10.0 years ago

andrew.j.skelton73 6.5k

Hi,

I've compared RNA Seq data to Microarray data (Downstream - same tissue cohort), by taking Mean expression of genes in the microarray and comparing them to their relative FPKM values (XY scatter). Is this an accepted method? Are there any others that people know of, that is generally accepted?

Also, if anyone has any suggestions in how to compare Methylation data to microarray data (again, same tissue type / cohort), to show a correlation between methylation and gene expression, it'd be very much appreciated!

Thanks

Microarray correlation RNA-Seq methylation • 3.9k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by andrew.j.skelton73 6.5k

3

Entering edit mode

10.0 years ago

Irsan ★ 7.8k

For the first part of your question: when talking about differential gene expression, in the end people are interested in (log) fold changes and their p-values. So those are the ones you should use when comparing rna-seq results with array results. You can calculate the pearson correlation coefficient between rna-seq and array logFCs or -log(p-values). In parallel you should do linear regression and get the slope estimate. When the pearson correlation coefficient and the slope are 1 you have a perfect fit. In order to quantify the influence of methylation on mRNA expression I would use pearson correlations and linear regression as well. Also have a look at the SIM bioconductor package for integration of various omics data sets

ADD COMMENT • link 10.0 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thanks for your points, all very useful!

ADD REPLY • link 10.0 years ago by andrew.j.skelton73 6.5k

Ram · Accepted Answer · 2014-05-07

Hello!

I used to calculate correlation between log2 microarray probe signal intensity (aka R.F.U) and log2 FPKM values. Note that the result will depend on several factors:

Are you comparing gene-wise of isoform-wise expression? While RNA-Seq allows in theory to capture all isoforms and microarray is limited to several isoforms by their design, I recommend grouping all isoforms by gene and selecting the isoform/probeset with maximal signal as reference one
FPKM values, which are defined as fragments per kilobas per million reads are actually calculated in different ways by different tools. For example Tophat apply various correction (e.g. GC correction) to FPKM values. You should check whether FPKM, or straightforward count/RPKM values give better correlation with your microarray data. See Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

As for me, I was able to blindly identify tumor RNA-Seq samples from breast and colon cancer by comparing them to a quite complete panel of reference tissue datasets obtained from GEO (have a look here for GEO accessions). I've got correlations in range 0.4-0.8 for all datasets. Microarray datasets that corresponded to the tumor tissue of origin gave a significantly higher correlation with tumor RNA-Seq data (in range of 0.6-0.8) than other tissue datasets.

As for Methylation data, there are many ways to show that. You can split you gene set in high-, mid- and low-expressed, split your promoter regions in methylated and un-methylated, build a contingency table and perform a statistical test for dependence. You can compare promoter methylation level distributions in groups of genes with high- and low-expression with something like Kolmogorov-Smirnov test, and vice-versa, expression distributions for genes with methylated and un-methylated promoters. As long as your data is biologically consistent, it should not much depend on statistical test you use, and you'll get a robust result, just try various approaches. Are you trying to do it for whole transcriptome or for a single gene/set of genes?