In this paper:
"A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification"
the authors compare different annotations effect on RNA-seq analysis, and they "[...] demonstrated that the choice of a gene model has an effect on the quantification results. [...] When choosing an annotation database, researchers should keep in mind that no database is perfect and some gene annotations might be inaccurate or entirely wrong. Wu et al. suggested that when conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation, such as RefGene, might be preferred. When conducting more exploratory research, a more complex genome annotation, such as Ensembl, should be chosen. Based upon our experience of RNA-Seq data analysis, we recommend using RefGene annotation if RNA-Seq is used as a replacement for a microarray in transcriptome profiling."
This let me think about filter the annotation. I am interested in a transcriptome profiling, so I am really not interested in a gene differential expressed with unknown function (for example). For consistency I would like to continue to use ensembl, in mouse annotation gtf file there are 55k genes, I wrote a small script to filter the gtf file only for genes annotated as havana_ensembl, this reduced the n of genes to 22K. Then I've done the RNAseq analysis with featurecounts/Deseq2 using the full_annontation and the filtered_annotation. The pvalue plots seems better using the filtered annotation (here I loaded just one example, left full annotation, right filtered annotation):
Do you think this approach is valid? I checked also on the biostar handbook but I didn't find anything about the annotation size and filtering.
Do you know a package to filter gtf file? I didn't find anything and I wrote mine, but I would feel more confident to use something already well tested.