Question: Is it okay to remove uncharacterised transcripts from downstream analysis in RNA-seq?
gravatar for antoinefelden
5 months ago by
antoinefelden20 wrote:

I work with a reference genome that is only partially annotated, and I'm wondering if it's okay for me to discard uncharacterised genes from my dataset (once I've properly calculated TMM-normalisation factors from all transcripts, including the uncharacterised ones).

I can deal with having lots of uncharacterised genes in the output of a classic DGE analysis (i.e. when looking at the top 100ish most DE genes, I can just acknowledge that a subset of these transcripts are unknown and that's fine). However, I also want to build a gene co-expression network (WGCNA), and I'd like to calculate GO enrichment on the relevant gene modules. But obviously, when a large portion of genes are unknown within a module, their GO terms are also unknown and a GO enrichment analysis doesn't really make sense. To overcome that, I want to discard uncharacterised transcripts and only run the analysis on annotated transcripts.

I'm aware that I could also try to annotate these genes myself, but for several reasons I'd rather not to (this genome assembly will be obsolete soon, and - although that's a never a good reason - I'm in a big rush to get a first version of this study out).

Here is a simple outline of the pipeline I'm talking about, starting from a gene raw count matrix:

  1. Apply TMM normalisation using all transcripts (i.e. true library size)
  2. Retrieve only transcripts for which there is a known annotation
  3. Run WGCNA on this subset of transcripts only
dge rna-seq wgcna • 242 views
ADD COMMENTlink modified 29 days ago by h.mon23k • written 5 months ago by antoinefelden20

I think for the enrichment part you can chose these annotated genes. Even if you had known genes, with unknown functions, you can still do GO analysis and its fairly acceptable. In my experience when I put my genes in DAVID for analysis, it doesn't recognize some IDs and discard it. These IDs could be psuedogenes or lncrna which are not part of DAVID annotation and the results are acceptable.

ADD REPLYlink written 5 months ago by piyushjo110
gravatar for h.mon
29 days ago by
h.mon23k wrote:

There are two reasons for not filtering genes when performing co-expression network analysis:

1) when filtering genes, you may change the shape of the network, changing the relation between groups or creating / removing groups.

2) one of the purposes of these analyses is precisely shed light on the function of unknown genes, by examining how they relate to known genes - by removing unknown genes, you gain no insight into their function.

I think you should perform the WGCNA analysis as recommended by the authors, and for the subsequent GO enrichment, discard modules with too few annotated genes.

ADD COMMENTlink modified 29 days ago • written 29 days ago by h.mon23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1926 users visited in the last hour