Question

Exclude pseudogenes and lncRNA's from DE-analysis?

1

Entering edit mode

2.6 years ago

Barista ▴ 40

Hi all,

Let's start off to thank the ones that helped me lately. I almost feel bad for how many questions I have asked in the last weeks, but the answers were always of great help, so thanks for that!

And yet I have another question. As described in my previous questions I am running a Quantseq experiment with 600 patients divided in 2 groups, where we try to gain insight in pathophysiological differences between the groups. The plan is to run a DE-analysis (EdgeR), GSEA, GO-enrichment and to illustrate this with a network (Cytoscape).

Now I have been discussing with collegues lately that importance of pseudogenes and lncRNA's. The opinion is that they should be removed and that we should only look at protein coding RNA's. Also because including non-important genes contribute to the FDR and such reduce your statistical power (same argument as for filtering out lowly expressed genes).

I really agree that excluding non-important genes increases your statistical power, but having read some articles, especially about lncRNA's, I am not that certain that those genes are not important. Furthermore, lncRNA's are thought to have a regulatory cell function and even seem to code for small peptides/proteins. Pseudogenes on the other hand are mostly disfunctional, but even some pseudogenes seem to have regulatory functions.

So my question is if someone has experience with this kind of experiments and what your approach was? Is it reasonable to only include protein_coding genes? Or would this lead to a significant loss of information? All opinions and experiences are greatly appreciated!

Edit: of important note. Also a pragmatic argument could be made to exclude non-coding genes, because it seems to me that the research of pseudogenes and lncRNA's is still in an early phase. If currently available databases do not include the function of these genes (for example in GO), then you will never find the function of these genes. (I hope this makes sense)

lncRNAs Quantseq • 2.9k views

ADD COMMENT • link updated 24 months ago by Gordon Smyth ★ 7.0k • written 2.6 years ago by Barista ▴ 40

score 3 · Answer 1 · 2021-09-23

3

Entering edit mode

2.6 years ago

ATpoint 81k

If you are not interested in them then I would exclude them prior to FDR adjustment, that would be the results step in DESeq2 or the topTags/topTable step in limma-voom and edgeR. There is no gold standard answer to this. After all reference annotations are wildly different, so you get many more genes being annotated when using GENCODE/Ensembl rather than RefSeq so FDR and which genes to include is anyway (to my feeling) a very arbitrary thing in genomics. If you have low power due to low replicate numbers and/or only modest effect or large dispersion then filtering more might make sense, especially if downstream experiments shall focus on protein-coding genes, e.g. Western Blot validation.

ADD COMMENT • link 2.6 years ago by ATpoint 81k

0

Entering edit mode

Hi ATpoint,

Thanks for you reply, I really find it an interesting topic, because I can imagine that when the functions of those genes are well annontated in the future, they should be included in the analysis. But for now I can imagine to exclude them.

I think I have enough biological replicates, but as the 2 conditions are quite similar, the effect is quite modest.

For when to filter out those genes, wouldn't it also be an idea to filter them out before performing the normalization/glmQLFit in EdgeR? Because if you filter them out after normalization it feels like they contribute to your normalization/model, but are discarded later because they are deemed not important. What do you think?

ADD REPLY • link 2.6 years ago by Barista ▴ 40

1

Entering edit mode

I think it is good that they contribute, more genes make everything more reliable (norm/disp). They were sequenced as part of the library so they influenced the counts of the protein-coding genes. Hence, I would leave them in, and then only exclude from multiple testing.

ADD REPLY • link 2.6 years ago by ATpoint 81k

0

Entering edit mode

That makes perfect sense, thank you!

ADD REPLY • link 2.6 years ago by Barista ▴ 40

0

Entering edit mode

In addition to my previous message, I stumbled across this blogpost: https://support.bioconductor.org/p/126332/. Here Gordon Smyth implies that removing non-coding genes after TMM-normalization is not the way to go, how I interpret it, is that they should be excluded before normalization. What do you think?

ADD REPLY • link 2.6 years ago by Barista ▴ 40

1

Entering edit mode

No, I didn't say that.

ATpoint and I have both said that you can go either way, either keep the non-coding genes in throughout or remove them at the beginning. Honestly, the edgeR software is quite robust and the purpose of the software is to allow you to do the analysis that you want to do, not trip you up with traps.

ATpoint pointed out that the different annotation systems are very different, which probably makes more difference than the annotation filtering you ask about. And the QuantSeq platform you are using has already focused on particular types RNA transcripts before you even start.

Personally, for bulk RNA-seq analyses where the purpose is to discover biological pathways, I often like to keep protein-coding + lncRNAs and remove everything else before normalization. I like to retain a more homogeneous universe of genes through the normalization and dispersion estimation steps, especially if the data is noisy or the number of replicates is small. Frankly, I like to use RefSeq annotation, which has the effect of removing most pseudo-genes even before the read counts are formed. I really hate over-active annotation systems that introduce thousands of almost-never-expressed pseudo-genes that overlap with important almost-certainly-expressed protein-coding genes. Human expression data is quite noisy and subject to batch effects so I take every opportunity to cut down on noise. But keeping everything in is not wrong and the edgeR software will work just fine either way. Of course any filtering has to be unbaised, by type of gene rather than gene function. The only requirement is that you have a large unbiased body of expressed genes through the normalization and dispersion estimation steps.

ADD REPLY • link 24 months ago by Gordon Smyth ★ 7.0k