Question

Question about differentially expressed genes selection for low sample size data

0

Entering edit mode

6.2 years ago

Vasu ▴ 800

I am using egdeR for differential analysis. It is between gene TULP4 silenced cell-line vs controls. As the sample size is very low 4 shTULP4 and 4 Controls, (shTULP4 vs Controls), I'm selecting differentially expressed genes only based on FDR < 0.05.

In this analysis I'm very much interested in one gene CENPH. I expected this to be downregulated in shTULP4 samples. And I see this is downregulated, but FDR is not < 0.05. But p-value is < 0.05.

GeneSymbols   logFC        logCPM          F          PValue        FDR
   CENPH    -1.581838087  0.068915541  4.826724432  0.039121118 0.250871972

My question is:

Is it fine to say this gene is significantly downregulated in shTULP4 samples based on logFC and PValue? (OR) I have to go with only with logFC and FDR?

And for the volcano plot is it ok to take -log10(PValue) on y-axis instead of -log10(FDR) ?

RNA-Seq R geneexpression edger pvalue • 2.9k views

ADD COMMENT • link updated 6.2 years ago by shawn.w.foley ★ 1.3k • written 6.2 years ago by Vasu ▴ 800

1

Entering edit mode

I think you can talk about a "trend" in this case, but additional validation is needed.

ADD REPLY • link 6.2 years ago by Martombo ★ 3.2k

score 3 · Answer 1 · 2019-05-01

3

Entering edit mode

6.2 years ago

shawn.w.foley ★ 1.3k

Since you're doing a genome-wide experiment you should absolutely stick with an FDR for all of your cutoffs for "significant" genes, to lower the bar and use a p-value instead of an FDR because a particular gene of interest doesn't reach significance can be seen as misleading. You can still report the downregulated trend for CENPH, you just cannot say that it is a significant difference.

In the case of CENPH it might be worth looking into why this is not significant. The logCPM is very low, it looks like the mean expression is around 1 CPM, so it's already at or slightly below the threshold for true expression. It could be worthwhile to make a density plot of the logCPM data to see where your cutoff for "expressed" genes should be.

As far as the actual volcano plot I'd recommend being consistent and graphing -log10(FDR), but technically there's nothing wrong with graphing -log10(PValue) as long as the axis is properly labeled (but it would be a red flag to someone with informatics experience reading the paper). If you're coloring your volcano plot to indicate "significant" versus "non-significant" genes then you should absolutely use the FDR cutoff, even if you're graphing the p-values.

ADD COMMENT • link 6.2 years ago by shawn.w.foley ★ 1.3k

2

Entering edit mode

Ah! Good answer but just focusing on the 'trend' part (also to Martombo). I used that word on Twitter a few weeks ago and was torn to pieces by the 'Twitter trolls' ... you know, those guys who are apparently Professors but seemingly have nothing better to do than look for opportunities to make themselves feel good at the expense of others.

ADD REPLY • link 6.2 years ago by Kevin Blighe 89k

0

Entering edit mode

Hey Kevin so what is your opinion on this finally?

ADD REPLY • link 6.2 years ago by Vasu ▴ 800

1

Entering edit mode

I don't have much to add - the questions you have asked are likely to receive different answers depending on who you ask. Shawn's answer is good, though. I would just be aware that the word 'trend' will likely be flagged (e.g. by a reviewer) if you use it.

ADD REPLY • link 6.2 years ago by Kevin Blighe 89k

0

Entering edit mode

I appreciate the feedback! So do the Twitter trolls/Reviewer #3s take offense to the word "trend" because this is a single experiment and a single gene, or is it offensive because it doesn't reach significance?

I often use it in the context of multiple doses, one dose reaching significance while multiple lower doses move in the same direction. Would "trend" be more appropriate in that case?

ADD REPLY • link 6.2 years ago by shawn.w.foley ★ 1.3k

2

Entering edit mode

Yes, I have used the term before when I was in the USA, as it is what a paediatrician working with me suggested. The thing that the trolls focused on was, well, how can you say that the trend is toward or away from statistical significance. Unfortunately it then just became a shaming exercise, where one guy even posted a Ben Affleck GIF where he was putting his head in his hands.

In this case (above), the nominal p-value passes 0.05 whilst the adjusted does not, so, I can understand the 'trend' is 'toward' in this case (yet the trolls would still tear this apart just to satiate themselves).

I would just be honest in this case and report the nominal and adjusted p-value in the text, and then state that sample n is an issue.

ADD REPLY • link 6.2 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks a lot for the answer shawn. I am aware that FDR is adjusted pvalue but dont know why pvalue is not used for significance instead of adjusted pvalue. If you dont mind could you please tell me exact reason why significance is based on FDR and not pvalue?

ADD REPLY • link 6.2 years ago by Vasu ▴ 800

2

Entering edit mode

Happy to help! I'll preface with the disclaimer that I'm not a statistician, and this is a biologist's understanding.

The FDR is adjusted for a multiple testing correction, so this question really gets at what you mean when you say the word "significant." As a convention we take that word to mean p < 0.05, or there's a 1/20 chance that this result is a false positive. However, if you're now looking at thousands of genes via RNA-seq, a 1/20 false positive rate is way too high. For example, if the knockout of a certain gene gives you a few hundred truly differentially expressed genes, by using a p-value of 0.05 with ~60,000 annotated genes (including ncRNAs) you'll have ~3,000 "significant" genes. Now a vast majority of your "significant" genes are false positives!

Adjusting the p-value is essentially drawing a new cutoff for "significance." Some adjustments, such as Bonferroni, simply say that for n tests, a p-value must be 0.05 / n (in the above example that would be 0.05 / 60,000 = 8.3e-7).

The Bonferroni correction always makes the most intuitive sense to me, but it's much too stringent. That's why the field has moved towards the Benjamini-Hochberg correction, aka the FDR. It's a less conservative way of determining how stringent we should be when we're looking at a large number of tests. The math is beyond me, but effectively it's saying that of all the genes with an FDR < 0.05, only 5% of them should be false positives, therefore when you look at any one "significant" gene there's now a 1/20 chance that it's a false positive. If you look through your file you'll see that FDR = 0.05 corresponds to a specific p-value, so effectively you're determining what your new p-value cutoff should be.