6.2 years ago by
Czech Republic, Brno, CEITEC
I used to calculate correlation between log2 microarray probe signal intensity (aka R.F.U) and log2 FPKM values. Note that the result will depend on several factors:
- Are you comparing gene-wise of isoform-wise expression? While RNA-Seq allows in theory to capture all isoforms and microarray is limited to several isoforms by their design, I recommend grouping all isoforms by gene and selecting the isoform/probeset with maximal signal as reference one
- FPKM values, which are defined as fragments per kilobas per million reads are actually calculated in different ways by different tools. For example Tophat apply various correction (e.g. GC correction) to FPKM values. You should check whether FPKM, or straightforward count/RPKM values give better correlation with your microarray data. See Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count
As for me, I was able to blindly identify tumor RNA-Seq samples from breast and colon cancer by comparing them to a quite complete panel of reference tissue datasets obtained from GEO (have a look here for GEO accessions). I've got correlations in range 0.4-0.8 for all datasets. Microarray datasets that corresponded to the tumor tissue of origin gave a significantly higher correlation with tumor RNA-Seq data (in range of 0.6-0.8) than other tissue datasets.
As for Methylation data, there are many ways to show that. You can split you gene set in high-, mid- and low-expressed, split your promoter regions in methylated and un-methylated, build a contingency table and perform a statistical test for dependence. You can compare promoter methylation level distributions in groups of genes with high- and low-expression with something like Kolmogorov-Smirnov test, and vice-versa, expression distributions for genes with methylated and un-methylated promoters. As long as your data is biologically consistent, it should not much depend on statistical test you use, and you'll get a robust result, just try various approaches. Are you trying to do it for whole transcriptome or for a single gene/set of genes?