4 months ago by
The missing gene_names from StringTie can originate from 3 different sources:
1) It is a novel transcript in a known gene
2) It is a novel transcript in a cluster of genes (multiple gene_names) which are joined together by StringTie/Cufflinks because of their overlap
3) It is a novel gene - meaning no genomic overlap with any feature in the reference you are using.
From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.
Hope this helps.