microarray, RNAseq, CEL, edgeR, etc. for DGE analysis
2
0
Entering edit mode
4.0 years ago
moxu ▴ 500

I am working on a project which involves both Affy microarray gene expression datasets and RNAseq datasets. My most recent experience is with RNAseq using edgeR. Although I have some experience with microarray DGE analysis, but that's quite a while ago and I am not sure how the field has progressed recently. Searching the internet found information posted more than 10 years ago, so I am not sure if such information is still valid today. So please bear with me if the questions below are too naive or have been answered elsewhere and are still correct.

• How to extract expression level from a .CEL file?
• How to collapse probe expression levels into gene expression levels?
• What's the best way to compare microarray samples with RNAseq samples? I understand it's not advised to do so, but what I have is a set of control samples in microarray, and a set of treatment samples in RNAseq.
• Can I use edgeR for the microarray datasets after some preprocessing (e.g. normalization)? I have developed a whole pipeline for DEG analysis based on edgeR (e.g. volcano plot, MDS plot, heatmaps), and it would be nice if the microarrays can be fed into edgeR.

RNA-Seq microarray gene • 3.3k views
2
Entering edit mode
4.0 years ago

How to extract expression level from a .CEL file?

Reading the fluorescent intensities in the CEL files, initially, will depend on the microarray manufacturer and version. If you let me know which one you are using, then I can guide further. Once you have read the CEL files into a Expression Set object, the subsequent steps are fairly standard for the majority of cDNA microarrays.

## ---------------------------------------

How to collapse probe expression levels into gene expression levels?

project.bgcorrect.norm.avg <- rma(project, background=TRUE, normalize=TRUE, target="core")
project.bgcorrect.norm.avg.Exons <- rma(project, background=TRUE, normalize=TRUE, target="probeset")


Usually this is performed during the normalisation, which is performed using rma() or gcmRA(). These perform background correction, quantile normalization, and then transform by log base 2 (in the case of gcrma(), expression values are also adjusted for probe and target sequence GC bias). More specifically:

• Summarise by gene: rma(..., background=TRUE, normalize=TRUE, target="core")
• Summarise by probe / exon: rma(..., background=TRUE, normalize=TRUE, target="probeset")

## -----------------------------------------

What's the best way to compare microarray samples with RNAseq samples? I understand it's not advised to do so, but what I have is a set of control samples in microarray, and a set of treatment samples in RNAseq.

Yes, why do you want to do this? The best that you can do is normalise them each as per their respective recommended guidelines, and then get them on the same data distribution (e.g. both log2 expression values). After that, you could convert these to the Z scale and then perform a simple / manual merge. If a transcript exists in one but not another, then there's nothing that you can do - it has to be eliminated.

## -------------------------------------------

Can I use edgeR for the microarray datasets after some preprocessing (e.g. normalization)? I have developed a whole pipeline for DEG analysis based on edgeR (e.g. volcano plot, MDS plot, heatmaps), and it would be nice if the microarrays can be fed into edgeR.

You should make your functions as reproducible as possible. MA, volcano, box, etc plots are all applied generally to different types of data; therefore, take the opportunity to adapt your functions for general use so that you can re-use them again and again. As a start, here's some code for a simple volcano plot (to convert this to a MA plot is easy): A: Volcano Plot from DEseq2

Here are some other ideas: A: Hierarchical Clustering in single-channel agilent microarray experiment

Kevin

0
Entering edit mode

About DGE of microarray data using edgeR: RNA-seq gene count falls into negative binomial distribution, and microarray probeset count falls into another (normal after log transformation?) distribution. And maybe logCPM should be replaced by log(count) for heatmap? I thought everything else would be the same? If so, then I could modify the distribution parameter, replace logCPM with log(count) and then use the whole pipeline? Reimplementing the whole thing again for microarray seems to be a daunting task. :)

Thanks a lot for the example of volcano plot, too. I have it implemented for edgeR.

1
Entering edit mode

For actual differential expression analysis, I would just use limma for your microarray data, which is the standard. As you implied, EdgeR expects a certain distribution of count values. Microarray and RNA-seq count distributions are inherently different, even after normalisation and log transformation.

Once you then obtain test statistics, you could use downstream functions of EdgeR, but I do not really see the point. If you do, I would just be very careful because, as an example, the heatmap function of edgeR is most likely performing commands that differ from the base heatmap functions.

I have made various postings on heatmaps, which you could follow:

1
Entering edit mode
4.0 years ago
h.mon 33k

You can do everything with limma, including reading CEL files, analysing RNAseq with limma-voom, and plot your stuff. In fact, several edgeR functions are in fact limma functions. Carefully read the limma User Guide, and then ask again if you still have questions.

What's the best way to compare microarray samples with RNAseq samples? I understand it's not advised to do so, but what I have is a set of control samples in microarray, and a set of treatment samples in RNAseq.

If you don't have samples in common between control and treatment, you can't disentangle technical variance from biological variance, so you can't know the biological significance of any gene you find as differentially expressed.