Question

RNA-SEQ where only a subset of genes is of interest

0

Entering edit mode

5.0 years ago

Aspire ▴ 390

I am performing an analysis on RNA-Seq data, where only genes relating to specific pathways are of interest to the researcher.

1) One option is normalizing, estimating the dispersion and performing DE with DESeq2 as usual (since DESeq's assumption that most genes are not differentially expressed pertains to the whole set of genes, not to the subset).

Following that, it would be possible to manually select only the relevant subset of genes, and apply FDR only to this specific subset, ( based on the p-values calculated when taking all genes into account).

This is somewhat analogous imho to what independent-filtering does. Independent-filtering after calculating p-values for all genes, subsets the list for only those with mean higher than a certain cuttoff, maximizing that cutoff. The explicitly stated goal is to increase the number of significantly DE genes, the rationale being that genes with genes with low expression are not interesting in the first place.

Here, what defines which genes are interesting is not the mean level, but the inclusion in a specific set.

Would the described process be suitable, and is there another one if not?

RNA-Seq deseq • 2.7k views

ADD COMMENT • link updated 2 hours ago by piffelpaff • 0 • written 5.0 years ago by Aspire ▴ 390

0

Entering edit mode

Hi! Just wanted to know the update or any suggestions about your queries above.

As I am also doing DEGs using RNA-seq on a subset of protein family.

ADD REPLY • link 12 months ago by Nelo ▴ 20

0

Entering edit mode

The power of common RNA-seq tools comes from borrowing information from all genes to most accurately estimate dispersions along the baseMean gradient. Later subsetting genes to a few in order to get smallest possible FDRs for them after using all genes to achieve this "power" sounds odd to me. I would do a standard analysis (after prefiltering for counts, see edgeR and DESeq2 manuals) and then take the stats as they're returned. It's cherrypicking if you do lots of custom subsetting.

ADD REPLY • link 12 months ago by ATpoint 89k

0

Entering edit mode

Sorry to revive this ancient thread but I'm currently discussing the same question with some people in my lab and would love to somebody pointing out the error in this line of reasoning, because so far I actually don't see it.

As far as I understand, DESeq2's normalization/dispersion correction shrinks variance & logFCs by using information from the whole dataset. This results in more realistic raw p-values and effect size estimates, thereby also reducing type I errors, as explicitly stated in the seminal paper from Love et al. When running thousands of hypothesis tests, FDR is corrected for to keep type I error at 5% (or however i set my FDR threshold).

If I am now interested in a set of target genes (defined a priori due to a certain biological questions, not after looking at the data), and I just happen to have a whole genome dataset, isn't it simply the most robust approach to run DESeq2 to get variance/effect size shrinkage (again, thereby also reducing type I error, which is kind of the opposite of p-hacking?) before I run my statistics? FDR correcting over 1000s of genes would then however also be incorrect because i am only interested in 30 hypothesis. So to not inflate type II error, I extract the raw p-values from DESeq2s output and run benjamini hochberg over those.

The alternative would be to just use the raw data without normalization, run Wilcoxon Rank Sum tests over those and have a higher risk of false positives and negatives given high dispersion rates that stem from my typically small sample size. Subsequent multiple comparison correction on these test would also only use the number of genes in my set, so I don't 'gain' anything in terms of statistical rigor when using this approach. In both variants FDR-correction is based on the number of genes that I'm interested in. I only lose the normalization/dispersion shrinkage that should increase the robustness of my inference.

I would really appreciate some insight on this question :D

Best regards

ADD REPLY • link 2 hours ago by piffelpaff • 0

score 2 · Answer 1 · 2020-09-15

2

Entering edit mode

5.0 years ago

andres.firrincieli 3.9k

manually select only the relevant subset of genes

Just my humble opinion, but that sounds like p-hacking. What is the selection criteria applied to get such relevant subset of genes?

ADD COMMENT • link 5.0 years ago by andres.firrincieli 3.9k

0

Entering edit mode

Biological criteria, not statistical criteria. This is the pathway(s) of most interest to the PI in the first place.

ADD REPLY • link 5.0 years ago by Aspire ▴ 390

0

Entering edit mode

Now I understand. I am sorry but I never heard something like that. By doing so you are basically getting rid of p-value ranking which is used for the FDR calculation. That does not sound right to me

ADD REPLY • link 5.0 years ago by andres.firrincieli 3.9k

score 2 · Answer 2 · 2020-09-15

2

Entering edit mode

5.0 years ago

ATpoint 89k

See the answer in a somewhat similar context from the DESeq2 developer: https://support.bioconductor.org/p/133932/#133938

I advise users to not try too many choices when desiring a certain outcome. See: “garden of forking paths” and issues with replication.

ADD COMMENT • link 5.0 years ago by ATpoint 89k

score 0 · Answer 3 · 2020-09-15

Are you trying this because doing the DESeq pipeline the normal way didn't give you what you wanted?

Unless you have an advanced degree in statistics, you should not presume to know more than the people who wrote your software. Don't get creative with statistical methods. You need to stay on the path in order to be sure that your results mean what you think they mean.