I work with a reference genome that is only partially annotated, and I'm wondering if it's okay for me to discard uncharacterised genes from my dataset (once I've properly calculated TMM-normalisation factors from all transcripts, including the uncharacterised ones).
I can deal with having lots of uncharacterised genes in the output of a classic DGE analysis (i.e. when looking at the top 100ish most DE genes, I can just acknowledge that a subset of these transcripts are unknown and that's fine). However, I also want to build a gene co-expression network (WGCNA), and I'd like to calculate GO enrichment on the relevant gene modules. But obviously, when a large portion of genes are unknown within a module, their GO terms are also unknown and a GO enrichment analysis doesn't really make sense. To overcome that, I want to discard uncharacterised transcripts and only run the analysis on annotated transcripts.
I'm aware that I could also try to annotate these genes myself, but for several reasons I'd rather not to (this genome assembly will be obsolete soon, and - although that's a never a good reason - I'm in a big rush to get a first version of this study out).
Here is a simple outline of the pipeline I'm talking about, starting from a gene raw count matrix:
- Apply TMM normalisation using all transcripts (i.e. true library size)
- Retrieve only transcripts for which there is a known annotation
- Run WGCNA on this subset of transcripts only