Question: edgeR gives no DE genes <0.05 FDR
0
gravatar for Jon
4 weeks ago by
Jon150
United States
Jon150 wrote:

Hi

Why edgeR giving me no DE genes with FDR <0.05? What could be wrong with my dataset? I have used the same codes for other dataset, I get at least 50 DE genes with FDR <0.05.

Thanks in advance!

qlf2<- glmQLFTest(fit, coef=2)
cs <- topTags(qlf2, n=Inf, sort.by="PValue")$table

last column of cs:

> FDR
> 0.6110051
> 0.6110051
> 0.6110051
> 0.6182122
> 0.8757088
> 0.8757088
> 0.8757088
> 0.8906914
> 0.8932381
> 0.8932381
ADD COMMENTlink modified 4 weeks ago by Antonio R. Franco4.2k • written 4 weeks ago by Jon150
2

Incomplete information. Can you provide the complete script? How do the samples look on PCA plot or distance matrix?

ADD REPLYlink written 4 weeks ago by Satyajeet Khare1.5k
3

The number and nature of the samples would be more important I guess. It probably comes down to an underpowered experiment.

ADD REPLYlink written 4 weeks ago by ATpoint25k
x<- read.table('/Users/sudharsankannan/desktop/pvrun/countsource/fcounts.txt', header=T, sep="")
##########################################edgeR ###################################
group <- factor(c(2,2,2,1,1,1,3,3,3))
y <- DGEList(counts=x,group=group)
keep <- filterByExpr(y, min.count = 5); y <- y[keep, , keep.lib.sizes=FALSE] #filter low reads, usiually >5-10
y <- calcNormFactors(y);y$samples #TMM normalization
y <- estimateDisp(y) #estimate qCML common dispersion and tagwise dispersions  
##DEG test
design <- model.matrix(~group)
fit <- glmQLFit(y, design)
qlf<- glmQLFTest(fit, coef=3:2)

qlf2<- glmQLFTest(fit, coef=2)
qlf3<- glmQLFTest(fit, coef=3)

In any of these three glmGLFTest I didn't get FDR <0.05, while 100s of genes with Pvalue <0.05 (even 60+ genes P< 0.01)

ADD REPLYlink modified 29 days ago • written 29 days ago by Jon150
1

Hi. As everyone suggested. Check the PCA on raw counts and with normalized counts and see clustering pattern. You might have some unwanted variations present that is affecting the FDR. How many replicates you have? You may try different normalisation methods like upper quartile or tmm and see how PCA changes. Hope this help.

ADD REPLYlink written 4 weeks ago by Ankit120

In addition to the PCA plot of your sample could you post the p-value (not FDR) histogram? Will help us till if the model you build is okay.

ADD REPLYlink written 4 weeks ago by kristoffer.vittingseerup2.6k
3
gravatar for ATpoint
4 weeks ago by
ATpoint25k
Germany
ATpoint25k wrote:

Why don't you simply tell how many replicates you have and what the experimental setup is? I do not see the point in trying methods until you get the desired results especially if you get no DEGs at all. Rather you should investigate if there is an inherent flaw in your experiment, be it little or no replication, unsufficient sequencing depth or any other sources of biases such as strong batch effects (which could be detected by e.g. PCA) that lead to large dispersion within your replicates, and by this prevent you from finding DEGs. I strongly suggest you do that first before considering alternative pipelines and accepting inflated FDRs. Doing the latter will give you plenty of false-positives. It is suspicious to get no DEGs at all so odds are good that 1) there are indeed none or 2) there is an inherent flaw in your experiment. I would start by plugging the count matrix into plotPCA from DESeq2 after normalizing raw data with vst. This will give you a first idea on how comparable the replicates are. If you want advice on the interpretation I suggest you upload a screenshot of that PCA.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by ATpoint25k
1

Thank you, here is my PCA

Reminder: I used edgeR, Three conditions triplicated.

PCA plot

ADD REPLYlink modified 29 days ago • written 29 days ago by Jon150

You should show log2-transformed normalized counts not non-logged ones, sorry should've mentioned that in my comment (the log "compensates" for the large differences between read counts of genes). Still, you can already see in the third PCA that there is no clear clustering by condition and some samples between conditions appear to be similar (at PC1 ~ 0) so (careful statement) the results you get (DEG = 0) is probably not too unexpected and rather a reflection of the little number of replicates and sample dispersion rather than the method.

ADD REPLYlink modified 29 days ago • written 29 days ago by ATpoint25k
1

Okay, Here is log transformed count's PCA

enter image description here

ADD REPLYlink written 29 days ago by Jon150

There you see it impressively: While the control group clusters reasonably both the red and especially the green ones are quite different from each other means there is quite dispersion in the red group and an outlier in the green ones that might explain your results. It is now on you to investigate the reason for this but based on this it seems to be an issue with the samples and not the method.

ADD REPLYlink written 29 days ago by ATpoint25k

Yes I understand, I used RSEM-coupled with-STAR to get expected counts from fastq files, Is there any way I can do some (pre-) processing make the data look good?

ADD REPLYlink modified 28 days ago • written 28 days ago by Jon150
2

No. You can't process your way around the fact that you have a red sample that looks just like the blue and green ones!

ADD REPLYlink written 28 days ago by swbarnes26.9k
1

Are there any batch effects that you could imagine of? Have the samples in the left corner been processed in one batch and the outliers in a second one or is there anything you can think of what makes them so different? You should ask yourself if the outliers are expected to represent the true treatment effect (so treatment = expected large effect) or if the effect is expected to be modest, so the bottomleft samples rather represent the true effect. Maybe it is simply normal variation (if this is human patients for example). Difficult to tell without further details. Any any case if this is the normal variation of your data, you need way more replicates.

ADD REPLYlink written 28 days ago by ATpoint25k

I removed each sample from each condition and looked at the PCA plots, there's an outlier in each condition. I will also try svaseq() to remove any batch effect if possible. Thanks for making me understand everything from this post .

ADD REPLYlink written 25 days ago by Jon150

I think this is very important - you need to critically assess your results.

Even if there is a way that essentially forces the data to look like you want, you need some way to show that your result is not horribly over-fit (and not actually reproducible in independent experiments / cohorts).

That said - I like what you have done here. This might already be is fair in showing an advantage to your starting expression over the expression with additional normalization. If that holds true with the differentially expressed genes (in a heatmap, not a PCA plot), then those are the main things that I can think of checking (besides functional enrichment, but I don't know that you expect).

ADD REPLYlink written 23 days ago by Charles Warden7.3k
1
gravatar for swbarnes2
4 weeks ago by
swbarnes26.9k
United States
swbarnes26.9k wrote:

What if there's really no significant difference between your samples? The other thing to consider is maybe one of your samples is mislabeled or misidentified. Generate a PCA plot (the DESeq vignettes and tutorials tell you how to do this) and see if the clustering of samples there looks sane.

Now that you have PCA plots up, you can see the problem right? That red circle (whatever the red color symbolizes) is mixed in with the greens and blues. And except for that green circle outlier, the green and blues are right next to each other.

Is there a chance that the red circle and green circle got mixed up?

ADD COMMENTlink modified 29 days ago • written 4 weeks ago by swbarnes26.9k
1
gravatar for Charles Warden
4 weeks ago by
Charles Warden7.3k
Duarte, CA
Charles Warden7.3k wrote:

There are at least 2 things that I would try:

1) Another method (DESeq2, limma-voom), etc. There are also multiple ways to calculate a p-value within edgeR (such as with dispersions estimated for "edgeR-robust")

You may not be able to lock-down the exact right one to use for a project ahead of time. Sometimes the results are similar, sometimes they are very different.

For example, it is a work in progress, but (when I have some free time) I have been trying to show this with public data, and I have accordingly added a "Update" to my GitHub acknowledgement. For example, for 2 out of the 3 E-MTAB-2682 comparisons, I could identify the gene being altered with DESeq2 or limma-voom but not edgeR (but the point is that you will find a project where the method doesn't work if you test enough projects, not that edgeR is worse than DESeq2 or limma-voom). You can see this in the Target_Recovery_Status.xlsx file on the SourceForge page (which should continually change over time, but probably slowly).

2) You can try increasing the FDR to 0.25 (or I have even seen the FDR increased to 0.50 to try and decrease false negatives). However, if you use a method for RNA-Seq analysis, I think it is rare to find no differences with FDR < 0.50 with any method. So, it is probably good to think of those results like a hypothesis.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Charles Warden7.3k
1

Your comment is really valuable to me, I will try the ones you have mentioned

ADD REPLYlink written 4 weeks ago by Jon150
1
gravatar for Antonio R. Franco
4 weeks ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.2k wrote:

Run a PCA assay to test whether samples are differentiated or not. Sometimes samples are mixed up.

My sequencing company sent me a treated sample as control and viceversa. With only that screwed sample in a triplicated experiment, I could not get any differential expression

ADD COMMENTlink written 4 weeks ago by Antonio R. Franco4.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1304 users visited in the last hour