I am using RNAseq analysis to find genes differentially expressed between 2 conditions. I am using StringTie for transcript assembly and quantification.
I am using prepDE.py in order to use StringTie with DESeq2
as instructed on http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq which outputs gene_count_matrix.csv? This file has Gene IDs. Some of them had gene like NM_000144 which was convenient to do downstream analysis after. But others of my data had rows with MSTRAG tag. Can I ignore these MSTRG genes in downstream analysis (Enrichment Analysis at pantherdb.oorg)? If not, how can I get the corresponding gene symbols?
StringTie annotation can have 2 problems:
1) Unassigned gene_name in single gene: It is a novel transcript in a known gene
2) Cluster of genes (multiple gene_names/gene_ids) which are joined together by StringTie because of their overlap in genomic space.
Lastly you can find novel genes which will also have no corresponding annoation.
From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.