Question

Expression thresholds for enrichment analysis of RNA-Seq data

0

Entering edit mode

3.3 years ago

James Cook • 0

Hi Guys

I have a question about filtering genes detected in an RNA-Seq experiment for subsequent enrichment analysis.

I know it's common to use a p-value threshold in combination with a fold-change cutoff to define your gene set. However, with my data I tend to find that I have a lot of genes which have quite high fold-changes but with low overall expression in my dataset. By contrast, I have quite a few DEGs with modest fold-changes but high overall expression levels.

My question is whether it is common to use absolute expression levels (eg group-wise maximum counts) as a criterion for defining gene lists for enrichment/pathway analyses, prioritising genes with high expression levels over those with low expression. Or if it's not commonly done, is there a good reason not to?

This would look something like the following:

1) define p-value and log2 fold-change cutoff (ie p<0.01, log2foldchange > 0.5 in either direction) 2) rank genes which pass according to maximum observed expression in any condition 3) take the top X amount of genes (500 perhaps) and enrich these

My reasoning is that I imagine genes with low expression are less likely to be relevant to the function of my cells than moderately/highly expressed genes, and due to their low read counts are also more susceptible to noise, so are more likely to be spurious hits which don't contain useful information about transcriptional/pathway-level changes. For instance, if gene A has 30,000 counts in the control and 60,000 counts in the test condition, whereas gene B has 30 counts in the control 120 counts in the treated groups, I'm more likely to believe that gene A represents a real and relevant change than gene B, even though the fold-change for gene B is double that of gene A.

Does this make sense statistically or am I setting myself up for a biased or otherwise flawed analysis by doing this?

Cheers!

RNA-Seq • 1.1k views

ADD COMMENT • link 3.3 years ago by James Cook • 0

0

Entering edit mode

Depending on the tool you are using, it may already be able to (in part) control for this issue (e.g. DESeq2 shrinks fold changes to control for this issue; this answer by one of the authors. And I think edgeR does it too). A search for "fold change shrinkage" will lead you to similar questions.

ADD REPLY • link 3.3 years ago by Papyrus ★ 2.9k

0

Entering edit mode

By today for DESeq2 it would be the lfcShrink option that explicitely shrinks the logFC estimates, so you could have a look at that. Filtering by p-values is generally difficult as genes with higher expression levels have higher power and therefore generally lower p-values which is not strictly correlated with the logFC, plus logFC estimates are noisy if counts are low, which is why the DESeq2/lfcShrink approach is useful. In edgeR there is glmTreat to explicitely test against a minimal fold change that is not zero. The authors claim that the resulting p-values are more useful for ranking than the default ones when testing against a fold change of zero.

ADD REPLY • link 3.3 years ago by ATpoint 82k

0

Entering edit mode

Yep, my data are already processed with DESeq2 so I'm not too concerned about the inflated fold-change issue, although I still think that, despite the shrinkage there will still be much more of a noise influence on these genes.

My main issue is the idea of selecting by expression level rather than relying primarily on fold-change or p-value - as i mentioned I still think that low-expressed genes are less likely to tell me useful/relevant things about what is happening in my cells?

ADD REPLY • link 3.3 years ago by James Cook • 0

0

Entering edit mode

Yes I understand, although in my opinion this kind of approach seems to me to be more typical of a pre-testing phase. Such as the usual filtering for low or negligible expression genes (which I have seen to be done for genes up to 10-100 of counts).

It will always be better if you provide a justification instead of an arbitrary cutoff. For example, when you plot the histogram distribution of counts (or log) you may see two or more "bumps" which may help you to do a data-driven decision to only retain the "main" population of expressed genes. Or maybe if you have biological knowledge of genes which are non-relevant in your system you can also look at their counts, etc. However, within the population that appears to be expressed and detected by the RNA-seq, it will be more difficult to state at which point the RNA-seq data starts to be "less trustworthy" or "relevant", because in the end one assumes that the technique is measuring correctly.

Nonetheless, IMHO, even if you filter post-testing, if you clearly state what you do, and why, in your results/methods, I do not think it is incorrect but depends rather on the biological question at hand.

ADD REPLY • link 3.3 years ago by Papyrus ★ 2.9k

0

Entering edit mode

Thanks for this - I think that makes sense.

My problem with low-expressed genes is not so much that I don't think the changes are "real" as such, just that I question how useful they are as indicators of what type of signalling pathways or functional changes are occurring.

This might be specific to my dataset, I have found that if I just use p value and log2-fc changes to create the gene sets, the downregulated genes provide much "better" (ie lower p values, more significant gene sets) results than the upregulated ones.

I attribute this to the fact that most of my significantly downregulated genes were expressed at decent levels in the first instance - ie the cell is actively driving expression of the gene, presumably for a good "reason", and presumably had a good "reason" for switching it off. On the other hand, genes which are very close to zero in the control condition (we're talking 10-100 counts, so genes with accessible promotors but which are essentially doing nothing at baseline) can get decent fold-changes with only a modest increase in counts, which could easily be "side effects" of certain programmes activating of which they are not the "intended" target but still respond a little bit.

In other words, genuine transcriptional noise as opposed to technical noise from the sequencing. I know DESeq does penalise fold-changes for low expression, but it doesn't eliminate this problem.

This is how I came to think that filtering based on expression as well as fold-change, p-value etc might improve power and provide more biologically useful insights.

In any case I'll try what you've suggested and see if there are any obvious data-driven ways to do this since I also dislike arbitrary cutoffs.

Thanks!

ADD REPLY • link 3.3 years ago by James Cook • 0