Question

edgeR gives no DE genes <0.05 FDR

0

Entering edit mode

5.7 years ago

John ▴ 270

Hi

Why edgeR giving me no DE genes with FDR <0.05? What could be wrong with my dataset? I have used the same codes for other dataset, I get at least 50 DE genes with FDR <0.05.

Thanks in advance!

qlf2<- glmQLFTest(fit, coef=2)
cs <- topTags(qlf2, n=Inf, sort.by="PValue")$table

last column of cs:

> FDR
> 0.6110051
> 0.6110051
> 0.6110051
> 0.6182122
> 0.8757088
> 0.8757088
> 0.8757088
> 0.8906914
> 0.8932381
> 0.8932381

R RNA-Seq software error alignment • 6.2k views

ADD COMMENT • link updated 5.7 years ago by Antonio R. Franco ★ 5.2k • written 5.7 years ago by John ▴ 270

2

Entering edit mode

Incomplete information. Can you provide the complete script? How do the samples look on PCA plot or distance matrix?

ADD REPLY • link 5.7 years ago by Satyajeet Khare ★ 1.6k

3

Entering edit mode

The number and nature of the samples would be more important I guess. It probably comes down to an underpowered experiment.

ADD REPLY • link 5.7 years ago by ATpoint 88k

0

Entering edit mode

x<- read.table('/Users/sudharsankannan/desktop/pvrun/countsource/fcounts.txt', header=T, sep="")
##########################################edgeR ###################################
group <- factor(c(2,2,2,1,1,1,3,3,3))
y <- DGEList(counts=x,group=group)
keep <- filterByExpr(y, min.count = 5); y <- y[keep, , keep.lib.sizes=FALSE] #filter low reads, usiually >5-10
y <- calcNormFactors(y);y$samples #TMM normalization
y <- estimateDisp(y) #estimate qCML common dispersion and tagwise dispersions  
##DEG test
design <- model.matrix(~group)
fit <- glmQLFit(y, design)
qlf<- glmQLFTest(fit, coef=3:2)

qlf2<- glmQLFTest(fit, coef=2)
qlf3<- glmQLFTest(fit, coef=3)

In any of these three glmGLFTest I didn't get FDR <0.05, while 100s of genes with Pvalue <0.05 (even 60+ genes P< 0.01)

ADD REPLY • link 5.7 years ago by John ▴ 270

1

Entering edit mode

Hi. As everyone suggested. Check the PCA on raw counts and with normalized counts and see clustering pattern. You might have some unwanted variations present that is affecting the FDR. How many replicates you have? You may try different normalisation methods like upper quartile or tmm and see how PCA changes. Hope this help.

ADD REPLY • link 5.7 years ago by Ankit ▴ 520

0

Entering edit mode

In addition to the PCA plot of your sample could you post the p-value (not FDR) histogram? Will help us till if the model you build is okay.

ADD REPLY • link 5.7 years ago by Kristoffer Vitting-Seerup ★ 4.2k

score 4 · Answer 1 · 2019-10-14

4

Entering edit mode

5.7 years ago

ATpoint 88k

Why don't you simply tell how many replicates you have and what the experimental setup is? I do not see the point in trying methods until you get the desired results especially if you get no DEGs at all. Rather you should investigate if there is an inherent flaw in your experiment, be it little or no replication, unsufficient sequencing depth or any other sources of biases such as strong batch effects (which could be detected by e.g. PCA) that lead to large dispersion within your replicates, and by this prevent you from finding DEGs. I strongly suggest you do that first before considering alternative pipelines and accepting inflated FDRs. Doing the latter will give you plenty of false-positives. It is suspicious to get no DEGs at all so odds are good that 1) there are indeed none or 2) there is an inherent flaw in your experiment. I would start by plugging the count matrix into plotPCA from DESeq2 after normalizing raw data with vst. This will give you a first idea on how comparable the replicates are. If you want advice on the interpretation I suggest you upload a screenshot of that PCA.

ADD COMMENT • link 5.7 years ago by ATpoint 88k

1

Entering edit mode

Thank you, here is my PCA

Reminder: I used edgeR, Three conditions triplicated.

PCA plot

ADD REPLY • link 5.7 years ago by John ▴ 270

0

Entering edit mode

You should show log2-transformed normalized counts not non-logged ones, sorry should've mentioned that in my comment (the log "compensates" for the large differences between read counts of genes). Still, you can already see in the third PCA that there is no clear clustering by condition and some samples between conditions appear to be similar (at PC1 ~ 0) so (careful statement) the results you get (DEG = 0) is probably not too unexpected and rather a reflection of the little number of replicates and sample dispersion rather than the method.

ADD REPLY • link 5.7 years ago by ATpoint 88k

1

Entering edit mode

Okay, Here is log transformed count's PCA

enter image description here

ADD REPLY • link 5.7 years ago by John ▴ 270

0

Entering edit mode

There you see it impressively: While the control group clusters reasonably both the red and especially the green ones are quite different from each other means there is quite dispersion in the red group and an outlier in the green ones that might explain your results. It is now on you to investigate the reason for this but based on this it seems to be an issue with the samples and not the method.

ADD REPLY • link 5.7 years ago by ATpoint 88k

0

Entering edit mode

Yes I understand, I used RSEM-coupled with-STAR to get expected counts from fastq files, Is there any way I can do some (pre-) processing make the data look good?

ADD REPLY • link 5.7 years ago by John ▴ 270

2

Entering edit mode

No. You can't process your way around the fact that you have a red sample that looks just like the blue and green ones!

ADD REPLY • link 5.7 years ago by swbarnes2 15k

1

Entering edit mode

Are there any batch effects that you could imagine of? Have the samples in the left corner been processed in one batch and the outliers in a second one or is there anything you can think of what makes them so different? You should ask yourself if the outliers are expected to represent the true treatment effect (so treatment = expected large effect) or if the effect is expected to be modest, so the bottomleft samples rather represent the true effect. Maybe it is simply normal variation (if this is human patients for example). Difficult to tell without further details. Any any case if this is the normal variation of your data, you need way more replicates.

ADD REPLY • link 5.7 years ago by ATpoint 88k

0

Entering edit mode

I removed each sample from each condition and looked at the PCA plots, there's an outlier in each condition. I will also try svaseq() to remove any batch effect if possible. Thanks for making me understand everything from this post .

ADD REPLY • link 5.7 years ago by John ▴ 270

0

Entering edit mode

I think this is very important - you need to critically assess your results.

Even if there is a way that essentially forces the data to look like you want, you need some way to show that your result is not horribly over-fit (and not actually reproducible in independent experiments / cohorts).

That said - I like what you have done here. This might already be is fair in showing an advantage to your starting expression over the expression with additional normalization. If that holds true with the differentially expressed genes (in a heatmap, not a PCA plot), then those are the main things that I can think of checking (besides functional enrichment, but I don't know that you expect).

ADD REPLY • link 5.7 years ago by Charles Warden 8.3k

score 2 · Answer 2 · 2019-10-11

What if there's really no significant difference between your samples? The other thing to consider is maybe one of your samples is mislabeled or misidentified. Generate a PCA plot (the DESeq vignettes and tutorials tell you how to do this) and see if the clustering of samples there looks sane.

Now that you have PCA plots up, you can see the problem right? That red circle (whatever the red color symbolizes) is mixed in with the greens and blues. And except for that green circle outlier, the green and blues are right next to each other.

Is there a chance that the red circle and green circle got mixed up?

score 2 · Answer 3 · 2019-10-11

There are at least 2 things that I would try:

1) Another method (DESeq2, limma-voom), etc. There are also multiple ways to calculate a p-value within edgeR (such as with dispersions estimated for "edgeR-robust")

You may not be able to lock-down the exact right one to use for a project ahead of time. Sometimes the results are similar, sometimes they are very different.

For example, it is a work in progress, but (when I have some free time) I have been trying to show this with public data, and I have accordingly added a "Update" to my GitHub acknowledgement. For example, for 2 out of the 3 E-MTAB-2682 comparisons, I could identify the gene being altered with DESeq2 or limma-voom but not edgeR (but the point is that you will find a project where the method doesn't work if you test enough projects, not that edgeR is worse than DESeq2 or limma-voom). You can see this in the Target_Recovery_Status.xlsx file on the SourceForge page (which should continually change over time, but probably slowly).

2) You can try increasing the FDR to 0.25 (or I have even seen the FDR increased to 0.50 to try and decrease false negatives). However, if you use a method for RNA-Seq analysis, I think it is rare to find no differences with FDR < 0.50 with any method. So, it is probably good to think of those results like a hypothesis.

score 2 · Answer 4 · 2019-10-14

2

Entering edit mode

5.7 years ago

Antonio R. Franco ★ 5.2k

Run a PCA assay to test whether samples are differentiated or not. Sometimes samples are mixed up.

My sequencing company sent me a treated sample as control and viceversa. With only that screwed sample in a triplicated experiment, I could not get any differential expression

ADD COMMENT • link 5.7 years ago by Antonio R. Franco ★ 5.2k