Question

Resolving expression profiles of genes with same name but different MSTRG ID

0

Entering edit mode

3.8 years ago

SJ Basu ▴ 50

Hello People,

The following is a problem that I am trying really hard for long time to resolve but can't come up to reasonable solution.

I execute the new tuxedo pipeline to analyze my RNA-seq samples. [Hisat2 -> StringTie2 -> TxImport -> Deseq2]

Now I extract the TPM counts for each transcript from TxImport's object txi$abundance then total them for each gene. I find that a few genes are in duplicate

Gene_ID Treated1    Treated2    Treated3    Contrl01    Contrl02    Contrl03  
ENSG00000233639:PANTR1  0.177826    0.023043    0.270347    0.027134    0.116557    0.165914 
ENSG00000233639:PANTR1  0   0   0   0   0   0 
ENSG00000235652:AL356599.1  1.218245    0.161862    1.835312    0.662362    0.355804    0.938479 
ENSG00000235652:AL356599.1  0   0   0   0   0   0  
ENSG00000242759:LINC00882   0.249292    0.367449    0.152916    0.162572    0.096571    0.120887 
ENSG00000242759:LINC00882   0   0   0   0   0   0  
ENSG00000253853:AC246817.1  1.071109    1.144176    0.726447    1.027518    0.747923    0.72971 
ENSG00000253853:AC246817.1  0.197055    0.056572    0.085629    0.214114    0.139897    0.172577 
ENSG00000257761:AC078860.2  0.888787    0.967599    0.529618    4.099847    3.220867    2.677969 
ENSG00000257761:AC078860.2  0.231985    0.190595    0.022277    0.052071    0.021828    0.070146

Well this might not be a big issue, because I can manually remove them as they are just counts.

Now I perform Deseq's analysis on MSTRG IDs...but on extracting results, I convert the MSTRG IDs to gene names and the problem starts

GeneID      Gene_names              baseMean          log2FoldChange       P-Value
MSTRG.10967 ENSG00000054654:SYNE2   5950.04530060011    0.012667982845789   0.959394280265825
MSTRG.10968 ENSG00000054654:SYNE2   1.41078707122937    -4.23070771833164   0.070293522379558
MSTRG.10175 ENSG00000068650:ATP11A  1833.81055560052    1.56556217491457    8.77220310293335E-10
MSTRG.10176 ENSG00000068650:ATP11A  4.61787244133165    -2.5207751366497    0.038542082326133
MSTRG.30937 ENSG00000070614:NDST1   658.6970054024  1.41142315136311    3.01295144450564E-14
MSTRG.30938 ENSG00000070614:NDST1   1.56573096638691    0.033948989808379   0.985488588961825
MSTRG.11506 ENSG00000100865:CINP    439.00088940909 0.091933520534186   0.786005173471876
MSTRG.11508 ENSG00000100865:CINP    5.63935284665922    -2.05575795178324   0.038659140045149
MSTRG.17621 ENSG00000101558:VAPA    2059.87434998168    0.005182706679134   0.982051464465322
MSTRG.17622 ENSG00000101558:VAPA    3.36185575110659    -2.37245905503554   0.090177894161586
MSTRG.27933 ENSG00000125388:GRK4    148.13570263131 0.922617617272806   0.004572310096685
MSTRG.27935 ENSG00000125388:GRK4    20.9983000186827    -2.74804985745791   2.69737977243546E-05
MSTRG.25158 ENSG00000128254:C22orf24    1.89318586903034    2.02010767441085    0.382214666408296
MSTRG.25159 ENSG00000128254:C22orf24    10.4722908231843    -1.6212485675741    0.026702940921691
MSTRG.15380 ENSG00000141510:TP53    2011.12079411078    1.01082518533637    6.02405522650678E-06
MSTRG.15381 ENSG00000141510:TP53    40.6951300890702    0.578072468755891   0.262835426053252
MSTRG.6817  ENSG00000149256:TENM4   902.877209604941    1.41048092112734    1.2491105757664E-05
MSTRG.6819  ENSG00000149256:TENM4   0.433135937501165   -2.51416321098229   0.532866041255987
MSTRG.4173  ENSG00000150867:PIP4K2A 715.089818003131    -0.452948374471478  0.099058893543589
MSTRG.4174  ENSG00000150867:PIP4K2A 1.59372578144667    -4.40786190776114   0.093908906761754

How can we resolve these redundancy events (as these involve some very important genes)? What is the popular practice?

Thanks in advance

differential-gene-expression RNA-Seq stringtie • 1.1k views

ADD COMMENT • link updated 10 weeks ago by Ram 43k • written 3.8 years ago by SJ Basu ▴ 50

0

Entering edit mode

As a comment, if you want gene-level differential analysis then simply skip this unnecessarily-complicated pipeline and use something like salmon for quantification of reads against a transcriptome, followed by tximport. Unless you specifically look for novel transcripts you do not need stringtie. It just makes the analysis more complicated, creating issues like this. Don't let yourself be impressed by the fact that it was published in a notable journal, for standard gene-level DE analysis this pipeline is imho not recommended. salmon-tximport is what I would recommend.

ADD REPLY • link 3.6 years ago by ATpoint 81k

score 0 · Answer 1 · 2020-09-21

The problem with gene_names in StringTie data can originate from 3 different sources: 1) It is a novel transcript in a known gene 2) It is a novel transcript in a cluster of genes (multiple gene_names) which are joined together by StringTie/Cufflinks because of their overlap 3) It is a novel gene - meaning no genomic overlap with any feature in the reference you are using.

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.

Hope this helps.

Cheers

Kristoffer