Question

StringTie - A transcript matching to multiple genes

1

Entering edit mode

7.6 years ago

manuelmendoza ▴ 50

Hello!

I am trying to analyse differential gene expression in 83 samples split up into three groups following the protocol described by Pertea et al., 2016 (doi:10.1038/nprot.2016.095).

Now I have two CSV files for each pairwise comparison, one with the transcripts results and other with gene results. I have assigned a column with the gene name to the gene results file following Freeze's instructions (https://www.biostars.org/p/218136/).

A lot of rows have a dot "." as gene name... I suppose that it means "unknown gene". Is it correct? But when I look for these transcripts in the transcripts results file I find that the same transcript has multiple gene names. What's happen?

A technical question: What does StringTie do when a transcript matches with multiple exomes/genes? How does StringTie count it?

Gene results:

    geneNames feature          id        fc         pval       qval       exp_sig
1385            .    gene MSTRG.18605 0.3612696 8.836918e-05 0.02780747 Downregulated
2009            .    gene  MSTRG.2251 0.3705158 1.723619e-04 0.03990178 Downregulated
3565            .    gene MSTRG.31880 0.2855206 2.766333e-05 0.02210375 Downregulated
3577            .    gene  MSTRG.3192 0.4190300 2.500196e-04 0.04590761 Downregulated
7730            .    gene MSTRG.52616 0.4974403 1.902062e-04 0.04187998 Downregulated
8791 LOC102724999    gene MSTRG.57635 0.4391518 7.886925e-05 0.02780747 Downregulated
8833         RRM2    gene  MSTRG.5791 0.1491026 9.653982e-06 0.01517076 Downregulated
8941            .    gene MSTRG.58419 0.4839879 2.913421e-04 0.04883999 Downregulated
9248            .    gene  MSTRG.7286 0.4853837 4.294071e-05 0.02455956 Downregulated

Extract info about MSTRG.18605 in transcript file

          geneNames     geneIDs    feature    id        fc        pval      qval
12632         . MSTRG.18605 transcript 92336 0.8248940 0.348378059 0.8105198
12634         . MSTRG.18605 transcript 92338 0.5838309 0.180876736 0.7302217
12635         . MSTRG.18605 transcript 92339 0.4058340 0.051912182 0.5780383
12636         . MSTRG.18605 transcript 92340 0.3070723 0.002123351 0.2270620
12637         . MSTRG.18605 transcript 92341 0.8145933 0.510077645 0.8781143
12638     HLA-C MSTRG.18605 transcript 92342 0.1351463 0.002860136 0.2504040
12639         . MSTRG.18605 transcript 92343 1.0737243 0.810680044 0.9544281
12640         . MSTRG.18605 transcript 92344 0.4356676 0.012957412 0.3988734
12641         . MSTRG.18605 transcript 92345 0.5661032 0.014102596 0.4047605
12642         . MSTRG.18605 transcript 92346 0.4012696 0.048469168 0.5719337
12643         . MSTRG.18605 transcript 92347 0.2884399 0.004146397 0.2847510
12644         . MSTRG.18605 transcript 92348 0.3167699 0.033571700 0.5162445
12645         . MSTRG.18605 transcript 92349 0.3484434 0.017986318 0.4343761
12650         . MSTRG.18605 transcript 92355 1.1018119 0.684596561 0.9242470
12656         . MSTRG.18605 transcript 92362 1.2783697 0.407525370 0.8302648
12657         . MSTRG.18605 transcript 92363 0.6356763 0.383143717 0.8199931
12658         . MSTRG.18605 transcript 92364 0.9109425 0.752058633 0.9371627
12659         . MSTRG.18605 transcript 92365 0.7067905 0.098177182 0.6499381
12661         . MSTRG.18605 transcript 92367 0.7979155 0.371616108 0.8166484
12663         . MSTRG.18605 transcript 92372 0.8521393 0.347975902 0.8105198
12664         . MSTRG.18605 transcript 92373 1.1114767 0.630943495 0.9137496
12665         . MSTRG.18605 transcript 92374 1.1819176 0.227935366 0.7639951
12667         . MSTRG.18605 transcript 92376 0.8134280 0.242299301 0.7643093
12668         . MSTRG.18605 transcript 92377 0.8685366 0.616908528 0.9100701
12669     HLA-B MSTRG.18605 transcript 92378 0.4077784 0.080955383 0.6381334
12670         . MSTRG.18605 transcript 92380 1.0450625 0.892268294 0.9766652
12671     HLA-B MSTRG.18605 transcript 92381 0.8218658 0.667177666 0.9193687
12672     HLA-B MSTRG.18605 transcript 92382 0.7802102 0.441900921 0.8454283
12673         . MSTRG.18605 transcript 92383 1.1639513 0.778891293 0.9450891
12675         . MSTRG.18605 transcript 92386 0.5742808 0.170972054 0.7188115
12676   MIR6891 MSTRG.18605 transcript 92387 0.7063845 0.054794258 0.5844075
12677   MIR6891 MSTRG.18605 transcript 92389 1.3352687 0.217143588 0.7592203

RNA-Seq rna-seq R alignment Assembly • 2.9k views

ADD COMMENT • link updated 4.8 years ago by Kristoffer Vitting-Seerup ★ 4.2k • written 7.6 years ago by manuelmendoza ▴ 50

0

Entering edit mode

did you check the coordinates of these genes/transcripts?

ADD REPLY • link 7.6 years ago by cpad0112 21k

0

Entering edit mode

hey, did you find an answer to your question?

ADD REPLY • link 5.7 years ago by c_u ▴ 530

score 0 · Answer 1 · 2020-09-21

The missing gene_names from StringTie can originate from 3 different sources: 1) It is a novel transcript in a known gene 2) It is a novel transcript in a cluster of genes (multiple gene_names) which are joined together by StringTie/Cufflinks because of their overlap 3) It is a novel gene - meaning no genomic overlap with any feature in the reference you are using.

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.

Hope this helps.

Cheers

Kristoffer