Exclude pseudogenes and lncRNA's from DE-analysis?
Hi all,

Let's start off to thank the ones that helped me lately. I almost feel bad for how many questions I have asked in the last weeks, but the answers were always of great help, so thanks for that!

And yet I have another question. As described in my previous questions I am running a Quantseq experiment with 600 patients divided in 2 groups, where we try to gain insight in pathophysiological differences between the groups. The plan is to run a DE-analysis (EdgeR), GSEA, GO-enrichment and to illustrate this with a network (Cytoscape).

Now I have been discussing with collegues lately that importance of pseudogenes and lncRNA's. The opinion is that they should be removed and that we should only look at protein coding RNA's. Also because including non-important genes contribute to the FDR and such reduce your statistical power (same argument as for filtering out lowly expressed genes).

I really agree that excluding non-important genes increases your statistical power, but having read some articles, especially about lncRNA's, I am not that certain that those genes are not important. Furthermore, lncRNA's are thought to have a regulatory cell function and even seem to code for small peptides/proteins. Pseudogenes on the other hand are mostly disfunctional, but even some pseudogenes seem to have regulatory functions.

So my question is if someone has experience with this kind of experiments and what your approach was? Is it reasonable to only include protein_coding genes? Or would this lead to a significant loss of information? All opinions and experiences are greatly appreciated!

Edit: of important note. Also a pragmatic argument could be made to exclude non-coding genes, because it seems to me that the research of pseudogenes and lncRNA's is still in an early phase. If currently available databases do not include the function of these genes (for example in GO), then you will never find the function of these genes. (I hope this makes sense)

If you are not interested in them then I would exclude them prior to FDR adjustment, that would be the results step in DESeq2 or the topTags/topTable step in limma-voom and edgeR. There is no gold standard answer to this. After all reference annotations are wildly different, so you get many more genes being annotated when using GENCODE/Ensembl rather than RefSeq so FDR and which genes to include is anyway (to my feeling) a very arbitrary thing in genomics. If you have low power due to low replicate numbers and/or only modest effect or large dispersion then filtering more might make sense, especially if downstream experiments shall focus on protein-coding genes, e.g. Western Blot validation.

Hi ATpoint,

Thanks for you reply, I really find it an interesting topic, because I can imagine that when the functions of those genes are well annontated in the future, they should be included in the analysis. But for now I can imagine to exclude them.

I think I have enough biological replicates, but as the 2 conditions are quite similar, the effect is quite modest.

For when to filter out those genes, wouldn't it also be an idea to filter them out before performing the normalization/glmQLFit in EdgeR? Because if you filter them out after normalization it feels like they contribute to your normalization/model, but are discarded later because they are deemed not important. What do you think?

I think it is good that they contribute, more genes make everything more reliable (norm/disp). They were sequenced as part of the library so they influenced the counts of the protein-coding genes. Hence, I would leave them in, and then only exclude from multiple testing.

That makes perfect sense, thank you!

In addition to my previous message, I stumbled across this blogpost: https://support.bioconductor.org/p/126332/. Here Gordon Smyth implies that removing non-coding genes after TMM-normalization is not the way to go, how I interpret it, is that they should be excluded before normalization. What do you think?