Can I ignore these MSTRG genes in downstream analysis (pantherdb.org)?
1
2
Entering edit mode
2.5 years ago

Hi,

I am using RNAseq analysis to find genes differentially expressed between 2 conditions. I am using StringTie for transcript assembly and quantification. I am using prepDE.py in order to use StringTie with DESeq2 as instructed on http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq which outputs gene_count_matrix.csv? This file has Gene IDs. Some of them had gene like NM_000144 which was convenient to do downstream analysis after. But others of my data had rows with MSTRAG tag. Can I ignore these MSTRG genes in downstream analysis (Enrichment Analysis at pantherdb.oorg)? If not, how can I get the corresponding gene symbols? regards,

RNA-Seq deseq2 stringtie • 1.1k views
0
Entering edit mode
0
Entering edit mode

I did not understand this reply from the link you provided. "If you are interested only standard transcripts/genes (i.e Ensembl, all or targeted), it is okay to exclude MSTRG transcripts/genes for downstream analysis. But do not throw away those genes/transcripts. "

1
Entering edit mode

If you work with human or mouse (probably the most well-annotated organisms when it comes to genomics) why do you use stringtie at all? There are comprehensive annotations from GENCODE/Ensembl or RefSeq that you can quantify against. Transcript assembly is probably only beneficial if you look for new transcripts but not in standard analysis. Also keep in mind that transcript assembly probably requires quiet some sequencing depth and read length, so why the effort for standard DE analysis? I would simply quantify with salmon against Gencode transcriptome and then proceed with tximport and DESeq2. You would probably need to verify new transcripts from stringtie anyway to show that they are reliable and not artifacts, so save yourself the trouble.

0
Entering edit mode

ATpoint I have always liked your replys But not this one. I have already done the assembly using stringtie (on AWS). Moreover I promised my would be employer to use stringtie I am only getting 167 proper gene id’s out of the 4077 significantly different genes. The rest have MSTRG tags in their id’s.

0
Entering edit mode

Well, you don't have to like a reply, of course, but then why do you ask for help? :)

0
Entering edit mode

ATpoint is a professional person so he wilil rightly think that I am complementing him in that reply, especially that I asked him another question.

0
Entering edit mode
12 months ago

StringTie annotation can have 2 problems: 1) Unassigned gene_name in single gene: It is a novel transcript in a known gene 2) Cluster of genes (multiple gene_names/gene_ids) which are joined together by StringTie because of their overlap in genomic space. Lastly you can find novel genes which will also have no corresponding annoation.

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.

Hope this helps.

Cheers

Kristoffer