Question: Strange MA-plot using DESeq2
1
gravatar for stan
2.8 years ago by
stan70
Pretoria, South Africa
stan70 wrote:

Hello,

I am using DESeq2 to perform some differential expression analysis between 3 levels of a disease condition. However, I am getting a strange looking MA-plot when i compare some of the conditions. I am trying to understand what could be the reason for the skewness in my MA-plot.

Any help will be greatly appreciated.

I have provided the PCA plot of the different samples as well as one of the MA-plots (comparing LTNP vs VC). The analysis steps I have taken are below:

sampleFiles <- grep("count",list.files(directory),value=TRUE)
sampleCondition<-c(rep("EC", 3), rep("LTNP", 4), rep("VC",4))
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)

dds <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ]
rld <- rlog(dds, blind = FALSE)
dds <- DESeq(ddsHTSeq2)
res <- results(dds, contrast=c("condition", "LTNP", "VC"))

Thanks, Stan

MA-plot PCAplot

rna-seq deseq2 R • 2.0k views
ADD COMMENTlink modified 2.7 years ago by Biostar ♦♦ 20 • written 2.8 years ago by stan70
1

Can you add to your post to explain your preprocessing steps, and a bit more about your samples (species for example). Something's not right if 99% of your variance is explaining the difference between two conditions.

ADD REPLYlink written 2.8 years ago by andrew.j.skelton735.8k

I have added the analysis steps. These are samples from HIV-positive individuals grouped according to viral load

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by stan70

My only guess is that something is going very wrong when it's computing the size factors. Can you post the output of sizeFactors(dds)?

Edit: I should note that this is a long-shot explanation.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Devon Ryan92k
>  sizeFactors(dds)
      EC_1.count  EC_2.count  EC_3.count  LTNP_1.count 
4.69209178  4.68776603  4.46989074  5.10407662 

    LTNP_2.count    LTNP_3.count    LTNP_4.count    VC_1.count
4.88086201      3.97763836      6.47169000      0.07196773

    VC_2.count  VC_3.count  VC_4.count 
0.06824966  0.05651823  0.05470625
ADD REPLYlink modified 2.8 years ago by WouterDeCoster41k • written 2.8 years ago by stan70
4

Holy crap my long-shot explanation has legs! In my experience, whenever you get more than ~10x difference in scale factors you start getting really weird results. Since you have closer to 100x differences, it's quite likely that this is the cause of the problem.

Were the VC samples really sequenced to such a different depth? Can you post the scatter plot of the raw counts of one of the VC samples vs. one of the non-VC samples?

ADD REPLYlink written 2.8 years ago by Devon Ryan92k

Here is the scatter plot of the raw counts between one VC and LTNP samples.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by stan70

OK, so there seems to be a HUGE asymmetric difference between VC and LTNP and that's screwing up the size normalization. Do you have ERCC spike-ins? This is one of those rare cases where it'd be better to use them. I should note that the end effect of getting this right will be a very asymmetric change where many many more genes are down-regulated than up-regulated in LTNP vs. VC.

ADD REPLYlink written 2.8 years ago by Devon Ryan92k
1

Good point, but also just to be clear, this experiment was performed on the same sequencing run right? They're not random publicly available samples you've pulled down to compare? Also, if you're able, a multiQC report might help to look at.

ADD REPLYlink written 2.8 years ago by andrew.j.skelton735.8k

Well the VC's are very different compared to the rest... that's a clue to solve this mystery :p

ADD REPLYlink written 2.8 years ago by WouterDeCoster41k

99% variance between the groups explained by 1 PC!?!? There's no way that can be real. Did you set the size factors in an atypical manner?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Devon Ryan92k

I just followed the steps as outlined in the DESeq2 manual. Could there be a problem with my group "VC" samples? which is causing the 99% variance?

ADD REPLYlink written 2.8 years ago by stan70

You have 3 samples: how did you set the design variable? I struggled with it many times, and I know it can change a lot both in the plots and in the actual p-values.

EDIT: also, are you using normalized counts? DESeq2 wants raw counts.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Macspider2.9k

I used raw counts generated using HTSeq. I think my design is OK, it's as follows:

ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)

I only have one variable "condition", which has three levels (EC, VC, LTNP).

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by stan70

Can you add in what aligner you used, the code for that, and your code for HTSeqCount please.

ADD REPLYlink written 2.8 years ago by andrew.j.skelton735.8k

don't forget that DESeq2 default (as of version 1.14.1) is to plot only the top 500 genes (highest row variance) in the PCA

ADD REPLYlink written 2.7 years ago by Martombo2.5k

Is this written somewhere? Can you link this information here for me as well? I never read that, maybe I missed it! Thanks.

ADD REPLYlink written 2.7 years ago by Macspider2.9k

From R, look at ?plotPCA after importing DESeq2. It uses the top 500 by default, but this can be adapted using parameter.

ADD REPLYlink written 2.7 years ago by WouterDeCoster41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2015 users visited in the last hour