Question

Microarray and RNA-seq different result

0

Entering edit mode

9.6 years ago

bharata1803 ▴ 580

Hello all,

Currently I'm trying to compare differential expressed gene from microarray and RNA-seq. I have two categories, A and B for each microarray data and RNA-seq data. My hypothesis is, the DEG list for each micorarray and rna-seq will pretty much same, maybe several difference but not too many. For example, if a set of genes is down regulated from microarray, I will expect same result from RNA-seq. Currently, the result is less than half has similar DEG. Around less than 8000 from more than 16000 genes I checked. I don't see how much it is changed, but I just want to check the same up regulated or same down regulated or same level. As for the data, I downloaded the data from NCBI GEO and it is from different experiment (independent experiment) but both said it is from same cell type. With that difference in experiment, of course there will be difference. My question is, what is your opinion in this case in term of biological meaning? I want to use this as a basis of my comparison to other data set. If the basis is not reliable, I don't know if my result will be valid.

Edit: information about cut off

I forget to add information how I filter the up, down, or similar expression level. So, basically, my cutoff is an arbitrary one. I define like this:

Up regulation : logFC less than -0.75

Similar regulation : logFC between -0.75 to 0.75

Down regulation : logFC more than 0.75

I used Limma for both microarray and RNA-seq data to get the logFC.

rna-seq microarray • 4.7k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.6 years ago by bharata1803 ▴ 580

Ram · Answer 1 · 2015-11-30

1

Entering edit mode

9.6 years ago

cyril-cros ▴ 950

First, plot expression level vs log fold change for each method. Weakly expressed genes with high log fold change are not significant, you need some decent coverage (low coverage=probability, high coverage=meaningful statistics). It depends on the tissue of your sample, but not all genes are expressed (obviously). Take a cut-off value for expression level.

The usual way to proceed (eg from what I saw in papers) is then to plot the log fold change of your genes in the RNASeq experiment on one axis and in the microarray on the other. You should see a big blob of roughly invariant genes and a few outliers near the y=x diagonal. You will also need to calculate the statistical correlation.

You may likely have a quality control issue. Have you done some checks, like FastQC for the RNASeq? How many uniquely aligned reads do you have (from your read alignment software log)? For the microarray, do you have MA plots? For both methods, do you have biological replicates? Do you normalize your data? We would need more details about your workflow for each technique.

RT-qPCR on a few significant genes can also be a good idea to test which method is failing or to confirm your analysis of differentially expressed genes.

For inspiration, look up some of the figures in http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004593 (article about olfactory receptor expression levels).

EDIT: by cut-off, I don't mean the LFC but the mean normalized expression level of A and B.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by cyril-cros ▴ 950

0

Entering edit mode

Thank you for your suggestions. First, I think I can safely say the quality control issue is not the problem because the mapping rate is quite high, several above 85% and several other even reach more than 90% mapping. The one thing I concern is your suggestion about cut-off value for expression. I did a cut-off for my expression level by calculate average for all data plus every category. Then, I filter only genes that have mean above 1 for both each category and all category. I do that only for RNA-seq data because Limma said low expression level need to be filtered for RNA-seq data but not for microarray.

I will do the plotting and correlation like your suggestion to check this first.

Edit: Plot result

The result of the plot is a bit weird. The data blob is not on diagonal, but horizontal. The x axis is RNA-seq and y axis is microarray. This means that the logFC of microarray tend to have small variance compare to RNA-seq. The result of the boxplot also show this.

Thank you!

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by bharata1803 ▴ 580

0

Entering edit mode

Also, check the normalization step. If your conditions don't differ too much, you expect the average LFC to be at 0. The limma package documentation as case studies that are also helpful.

ADD REPLY • link 9.6 years ago by cyril-cros ▴ 950

0

Entering edit mode

I just check the normalization using boxplot. For in-between microarray and in-between RNA-seq, the normalization is good. The problem lies on comparison of normalization between RNA-seq and microarrayy. Both of them have different variance. Do I need to normalized the logFC of RNA-seq and microarray? Is it logical thing to do?

ADD REPLY • link 9.6 years ago by bharata1803 ▴ 580

0

Entering edit mode

Check this figure for another example: http://www.biomedcentral.com/1471-2164/12/628/figure/F4. Normalizing the logFC does not make a lot of sense. You normalize to smoothe the difference between replicates and between most genes, which yields an average logFC for each gene. What you should expect is:

both average logFC at 0 (ie the expression of most genes is invariant)
most points are near the diagonal (techniques are in agreement on whether a gene is more/less expressed)

What do you mean by variance? Difference between biological replicates?

PS: http://pastebin.com/QU72XYvX for some R code from a lab session. Read from line 232. The idea is that use some lines like limmaHigh <- row.names(limmaRes2[limmaRes2[,"logFC"]>2,]) to select genes with a log fold change above 2 or below -2 for each technique. You then plot your graph, and outline in yellow DE genes in both techniques.

It is important that your gene names are in the same order when plotting logFC_RNASeq and logFC_microarray...

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by cyril-cros ▴ 950

0

Entering edit mode

What I mean with variance is the variance of logFC. The logFC result from Microarray tend to have smaller logFC and smaller variance. The logFC result from RNA-seq tend to have bigger logFC and bigger variance. I think this is the result why the scatter plot is almost horizontal with microarray as y and RNA-seq as x. From the figure you gave, the scatter plot is diagonal.

ADD REPLY • link 9.6 years ago by bharata1803 ▴ 580

0

Entering edit mode

Ok, I will need to leave soon. Do you see the same effect with the most differentially expressed genes? In my experience RNASeq is ~~a bit~~ more sensitive and can detect larger variations.

If you look at GO enrichment and ignore your issues, are the techniques consistent? You can also take a look at the p-values. My guess is the microarrays won't be very informative... I don't know why you see a reduced logFC with them.

My best advice is to just use the RNASeq DE genes and confirm a few with RT-qPCR.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by cyril-cros ▴ 950

0

Entering edit mode

Thank you. Probably because this is data from different experiment. Not by the same person/author. I tried to do a meta analysis which combine data from different source. Besides this differencee, other result seems make sense and I think I will just focus from other point of view.

ADD REPLY • link 9.6 years ago by bharata1803 ▴ 580

Ram · Answer 2 · 2015-11-30

0

Entering edit mode

9.6 years ago

Jason Chen ▴ 20

I don't think you are using a cutoff for significance so it makes sense that roughly half are different. Most of your "DEG" signal is probably noise (probably, most genes will not be differentially expressed) so there is an equal chance that it will be "up" or "down" in the other datasets. I think you should only consider differentially expressed genes at some log fold change/p value threshold, and then hopefully you will get more agreement.

ADD COMMENT • link 9.6 years ago by Jason Chen ▴ 20

0

Entering edit mode

Do you mean the cutoff is for determining the up or down or similar DEG? I updated the threshold of definition up, down, and same level. Maybe you can give some suggestion?

ADD REPLY • link 9.6 years ago by bharata1803 ▴ 580

0

Entering edit mode

No, I believe what he means is that simply looking at the difference in up/down fold change is meaningless for most genes. Rather, you should restrict your analysis only to a small sub-set of genes that are classified as statistically significant (by having a small adjusted p-value) by your DGE software. If you look at only genes where the differential expression is statistically significant (which should be a small subset of all genes), then you will likely (hopefully) see much more agreement.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Rob 7.1k