Question: How has StringTie caculated the transcript coverage?
gravatar for ddzhangzz
2.4 years ago by
United States
ddzhangzz90 wrote:

Recently I have used Stringtie to compute the reads of RNASeq mapping to transcripts. There are two transcripts of a gene with exactly same length and number of exons (as well as the assembly structure of the two transcripts) and I found the coverages were very different from each other.

t_id    chr     strand  start   end     t_name             _exons       length  gene_id              ene_name   cov             FPKM
77237   chr17   -       7668402 7687538 ENST00000269305.7       11      2579    ENSG00000141510.14      TP53    31.946598       5.549151
77238   chr17   -       7668402 7687538 ENST00000620739.3       11      2579    ENSG00000141510.14      TP53    2.961419        0.514401

I am wondering how the stringtie has calculated the coverage. By its definition and if my understand were correct, the coverage was calculated as \sum{seq_i*mapped-seq-length_i}{i=1}{m}/transcript_length. If this is true, I expect the coverage should be similar of these two transcripts but why they were so different.

rna-seq • 1.9k views
ADD COMMENTlink modified 2.3 years ago by geo.pertea80 • written 2.4 years ago by ddzhangzz90

Did you find the solution anywhere else? we are struggling to find out the same. It is not clear anywhere.

ADD REPLYlink written 2.3 years ago by lakhujanivijay5.0k

you may follow up with this post on github. may be someone is listening

ADD REPLYlink written 2.3 years ago by lakhujanivijay5.0k
gravatar for geo.pertea
2.3 years ago by
geo.pertea80 wrote:

Please see this answer about how coverage values are calculated by StringTie. Transcript and exon coverage values for overlapping transcripts (alternate isoforms) are calculated after distributing the read alignments according to the maximum flow algorithm -- it's not as simple as applying a formula.

For this particular question, without further data I presume that ENST00000269305.7 and ENST00000620739.3 are somehow distinct isoforms (so not exactly identical in their intron-exon structure, otherwise one of them would be discarded when the input file is loaded).

ADD COMMENTlink written 2.3 years ago by geo.pertea80

ENST00000269305.7 and ENST00000620739.3 are truely identical in exons assembly (even they are assigned to different Ensembl IDs) (probably due to they have differently AA seq). These cases also seem not rare and we found at least "5" duplicated transcripts in one gene. My question was to understand how Stringtie treated them. Comparing to Salmon, it has removed one of these duplicated isoforms but I still wanted to know the details in Stringtie.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by ddzhangzz90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 733 users visited in the last hour