I need to have TMP values from my RNA-seq on mouse samples to feed to different tools to get expression values per gene. I aligned the reads using Star (Grcm38 genome and Gencode M25 for junctions annotations). As I calculated the TPM (using TPMCalculator from ncbi) using gencode annnotation, I am not finding gene names for considerable part of the genes/transcripts which account for ~25% of the expression values (being 1M). If I do the TPM normalization using ncbiRef curated annotation, of course all identified easily with symbols. I know that the difference in their annotation is due to their curation strategies and I would be more in favor of gencode annotation as they depend on ensemble both curated resources. But it makes analysis easier later on with all gene IDs being identified (I am not interested in specific gene biotypes like pseudogenes).
I have 2 questions on that regard,
1- Since I need the gene symbols for downstream analysis (quantile ranking, active/nonactive), is it fine to drop the genes from gencode of which can't be correlated to other annotations and account for 25% of the expression (TPM-wise).
2- Is it fine to use the BAM files aligned (using gencode for juctions) and feed it to TPMCalculator using the gtf of ncbiRef. The alignments were done on GRCm38. And i imagine as long the counts are to be treated per gene (rather than by transcript) it should be okay. It might not have that effect with normal TPM normalization (estimating gene length in conventional method) but as this tool calculates the gene effective length taking into consideration different transcripts of the gene and any overlapping features, the annotation might make slight differences in the values.
Any insights are appreciated and soory already for the naive question.