Question

A Strange Error of DESeq2 Analysis(modified: DESeq2 does not Point Out the Obvious Difference)

1

Entering edit mode

5.7 years ago

afli ▴ 190

Hi, dear friends, I got a strange thing when doing DESeq2 analysis, maybe I just made a silly mistake. Briefly speaking, sample B is much higher than sample A and C, all data are normalized by DESeq2, yet DESeq2 missed this significant group. Could anyone please tell me why? Thank you very much!

My command lines are as follows:

library(DESeq2)
sample_test<-read.table("sample_test.txt",header=T)
counts_test<-read.table("counts_test.txt",header=T,row.names=1)
dds <- DESeqDataSetFromMatrix( countData=counts_test[rowSums(counts_test)>0,], colData=sample_test, design=~ condition)
dds<-DESeq(dds)
library("apeglm")
res<- lfcShrink(dds, coef="condition_B_vs_A", type="apeglm")
write.table(res,"DESeq2_result.txt",sep="\t",quote=F)
norm<-counts(dds,normalized=T)
write.table(norm,"test_normalize.txt",sep="\t",quote=F)

The strange thing happens in Chr6_6392513_6392788_peak. In normalized data, A sample(3 rep), B sample(3 rep) and C sample(3 rep):

Chr6_6392513_6392788_peak   178.648023305415    1.12212517818275    5.79671671533368    1246.9014826137 1124.82594062888      778.761332712122  0   3.66331655600197    0.

Obviously, B is much higher than A and C.

But in DESeq2_result.txt, which compares B and A, I got the following result:

Chr6_6392513_6392788_peak   371.079881967738    0.0246548548440441  0.1784530598261 0.026487239202988   0.347601564343674.

The padj is 0.35, not significant when padj threshold is set as 0.1.

The files are:

DESeq2_result.txt (https://de.cyverse.org/anon-files/iplant/home/afli123/analyses/DESeq2_result.txt)
counts_test.txt (https://de.cyverse.org/anon-files/iplant/home/afli123/analyses/counts_test.txt)
sample_test.txt (https://de.cyverse.org/anon-files/iplant/home/afli123/analyses/sample_test.txt)
test_normalize.txt (https://de.cyverse.org/anon-files/iplant/home/afli123/analyses/test_normalize.txt)

Thank you for your attention!

DESeq2 pvalue • 2.8k views

ADD COMMENT • link 5.7 years ago by afli ▴ 190

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY • link 5.7 years ago by Ram 43k

0

Entering edit mode

OK, thank you very much, Ram, I will do this next time.

ADD REPLY • link 5.7 years ago by afli ▴ 190

0

Entering edit mode

I don't see any errors, just a "strange" result. Instead of res<- lfcShrink(dds, coef="variety_B_vs_A", type="apeglm"), try with something like:

res.shr.cont <- lfcShrink(dds=dds, contrast=c("variety","B","A"), type="apeglm")

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Thank you h.mon, but the contrast command is not fit with apeglm, I got this error:

Error in lfcShrink(dds = dds, contrast = c("variety", "B", "A"), type = "apeglm") : type='apeglm' shrinkage only for use with 'coef'.

My code has no error, but the result is strange, since DESeq2 does not point out the obviouse difference, and I don't know why.

ADD REPLY • link 5.7 years ago by afli ▴ 190

0

Entering edit mode

I do not see anything unusual - DESeq2 actually did very well here, i.e., in not identifying this as statistically significant. Just take a look at the numbers that you've got:

Sample A

178.648023305415    1.12212517818275    5.79671671533368

--> not reproducible across your replicates; extremely high variance

Sample B

1246.9014826137 1124.82594062888      778.761332712122

--> more reproducible than A but still not great; high variance

Sample C

0   3.66331655600197    0

--> again poor reproducibility across your replicates. Also, conducting reliable stats on low counts is difficult

My human brain looking at this tells me that something's not right with your replicates. So, you should look over the wet lab protocols to see what may have happened.

Kevin

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you Kevin.

So in this case, what does the padj value 0.35 mean? Does it mean A sample is the same with B sample? Or, it means DESeq2 just fail to tell the difference? In the first case, it seems a bit strange because the smallest B value is much bigger than the biggest value in A and C. Can I say sample A is the same with sample B? In the second case, is it practical that DESeq2 just excludes the undecidable genes/peaks, so we can say XX gene/peak is similar with XX gene/peak?

ADD REPLY • link 5.7 years ago by afli ▴ 190

0

Entering edit mode

From my perspective, at least, it says that they are not statistically different at a Benjamini-Hochberg adjusted threshold of 5% (0.05). It does not say much about how similar they are (you would have to come up with a different test).

The question you should probably ask is why your replicates have such different values? Your example is the perfect example of why people should simply not be doing 1 versus 1 or 2 versus 2 comparisons. Replicates do not always behave like true replicates.

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes, my title may have some misleading effect. You've mentioned that the three replicate of A and C have large variance, and I should look over the wet lab protocols. Actually, I did the experiment for three separate times, every time with one A, one B and one C. So the replicates are ture biological replicates. If you have wet lab experience you can understand that biological replicates have larger variance than technological replicates. Besides, Chr6_6392513_6392788_peak is just a rare case, most regions are consistent between replicates. You've also mention the key point that omic data should have at least three replicates. I saw a paper before which suggested that we'd better do six replicates for one sample.

ADD REPLY • link 5.7 years ago by afli ▴ 190

0

Entering edit mode

Great, yes, 6 replicates would be even better. Unfortunately, some papers also state that using just 1 replicate is okay, which, from my perspective, is reckless.

If Chr6_6392513_6392788 is just a rare example among all of your genomic regions, then perhaps your wet lab work s good! You may want to explore what is specific about that region.

I spent long hours in the wet lab but eventually migrated to bioinformatics. I worked in shellfish microbiology, waste-water testing, and cancer circulating free DNA analysis on blood.

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

0

Entering edit mode

We have similar experience, I spend three years doing wet lab experiment but now I am learning some basic bioinformatic skills, because I have to analyse my samples. I am working in plant, studying plant genome function. I think you are very good at bioinformatics because you have high reputation value, haha.

Cheers

Aifu

ADD REPLY • link 5.7 years ago by afli ▴ 190

score 0 · Answer 1 · 2018-08-01

0

Entering edit mode

5.7 years ago

afli ▴ 190

In conclusion, I used to think that large padj value means not significant and two samples are similar, but now I think this is not the case. Just like the sample above, when one sample has very large variance, DESeq2 could not tell the difference(In A sample, rep value could vary from 0 to 178, so if we have another rep, maybe the value could further increase to 1780). However, the large padj value does not mean that the two samples are similar, this is why I feel strange. To say that two samples are similar, I think we should further see the lfc value(DESeq2 should not use the shrink function, or the lfc will be decreased), the lfc value of Chr6_6392513_6392788_peak is nearly 4 without shrink, very large and therefore the two samples are not likely to be similar.

In a word, I trust DESeq2's result, Chr6_6392513_6392788_peak is not significantly different between A and B sample, but we could neither say it is similar between A and B sample. We just need more replicates to confirm whether A and B are different or similar.

If I am wrong, please tell me, thank you!

ADD COMMENT • link 5.7 years ago by afli ▴ 190

0

Entering edit mode

You cannot make the assumption that the samples are similar based on the P value.

Replicates being replicates, they should have comparable values, assuming they are prepared in the same way. There may be some other biological factors in your experiment for which you have not controlled in your modelling. What could that be? You prepared the replicates in exactly the same way?

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

0

Entering edit mode

I prepare my samples in three times, to make biological replicates. I try to control the experimental conditions to be same, however, as you said, there's maybe unkown factors affect my experiment each time. So it is reasonable that replicates are not very similar. That's why I use replicates, and use DESeq2 to identify the stable differences between the samples.

ADD REPLY • link 5.7 years ago by afli ▴ 190

1

Entering edit mode

So you repeated the entire experiment three times but did only one biological sample for A/B/C each time. Is that correct?

Exp 1 - Sample A - Biol Rep 1
Exp 2 - Sample A - Biol Rep 2
Exp 3 - Sample A - Biol Rep 3

And so on for B and C?

Is the large difference seen in only one region that you highlight and rest of the data looks reasonably similar (as biological replicates should, accounting for normal biological variation?).

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax, I repeat the entire experiment three times, each time with three samples: A/B/C.

Exp1 - Sample_A_rep1 Sample_B_rep1 Sample_C_rep1
Exp2 - Sample_A_rep2 Sample_B_rep2 Sample_C_rep2
Exp3 - Sample_A_rep3 Sample_B_rep3 Sample_C_rep3

The large difference is not only seen in one region. The reason why I say the biological replicates behave similar is that the sample correlation heatmap and pca plot(produced by DESeq2 workflow) looks quite good. There maybe other region that have large variance among replicates but the total number will be small.

ADD REPLY • link 5.7 years ago by afli ▴ 190

0

Entering edit mode

What do A/B/C represent? Is one of them a control and other two treatments/conditions?

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Excuse me genomax, because my work is ongoing, so I could not tell all the details. A/B/C may represent a control with two treatment, or a wild type, a mutant and a overexpression line, and so on.

ADD REPLY • link 5.7 years ago by afli ▴ 190

1

Entering edit mode

We don't need to know the specifics as long as you are doing the logical comparisons. It would have been better to do all three Exp1-3 at the same time (that may have avoided the wide variation you see) but perhaps that was not experimentally feasible.

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Yeah, thank you for your suggestion. Though Exp1, Exp2 and Exp3 are same experiment, but it is hard to perform at the same time. Besides, I think that performing experiment in three separate times is more feasible than perform at the same time, thus I'm confident that my experiment could be repeatable.

ADD REPLY • link 5.7 years ago by afli ▴ 190