I have a question on filtering transcripts from Ensemble by transcription support levels (TSL). Currently, I am collecting Ensembl transcripts for three separate purposes:
- Calculating nucleotide distributions on exons, introns, and UTRs separately (from canonical transcripts to avoid redundancy).
- For a given genomic location, providing an annotation based on location in a gene model.
- RNA quantification (I have recently read this: https://cgatoxford.wordpress.com/2015/10/21/improving-kallisto-quantification-accuracy-by-filtering-the-gene-set/)
I am inclined to filter out TSL 4 (the best supporting EST is flagged as suspect) and TSL 5 (no single transcript supports the model structure) for all purposes to provide more accurate distributions, annotations, quantification, etc.
When I filter according to this criteria, 50,672 transcripts (total: 191,632) and 5,401 canonical transcripts (total: 57,387) are eliminated from autosomal chromosomes. Among eliminated transcripts, 22,035 transcripts (~29% of total protein coding transcripts) and 1,941 canonical transcripts (~10% of total protein coding canonical transcripts) are protein coding. Since these numbers are a bit high and may influence especially the first purpose, I became a bit suspicious of this strategy. At this point, the link I have provided shows an interesting result for quantification too, which left me more confused.
So, would you think this type of filtering is appropriate for the given purposes, or is it an over-conservative and/or unnecessary approach?