I am trying to calculate TPM from a raw count csv supplementary file found on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE141511). I am not interested in DGE but mainly in comparing levels between different genes.
For that, I retrieve gene length from the mouse GENCODE Comprehensive gene annotation gtf file (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M31/gencode.vM31.chr_patch_hapl_scaff.annotation.gtf.gz)
But when I compare Ensembl ID from the RNAseq and Ensembl ID from the gtf file, about 3000 IDs from RNAseq are not found in the GENCODE gtf file. Many of them are actually mapping to other mouse strains and not the reference C57bl/6J.
But also 1 transcript in particular has huge counts! Something like more than 600000 counts in each samples for total library size of 9 millions per sample! It is ENSMUSG00000097971, which map to a deprecated ID of a miRNA Gm26917.
My questions 1- do you think that this huge count for ENSMUSG00000097971 is an artefact coming from problems in original reads mapping? It seems crazy that there are so many reads for 1 miRNA, and even more for a deprecated ID.
2- What do you do of all transcripts reads that are not found in the database used for gene length? Do you drop them and carry normalization without them?
3-Do you use larger database than GENCODE to try to match all genes from RNAseq and if yes which one?
Thanks a lot for your help.