Question

Problem in the results of StringTie

3

Entering edit mode

8.1 years ago

wangyang703092 ▴ 120

Hi, When I quantifies expression levels for the features of the transcriptome using StringTie, some results in the output really made me confused. In the e_data.ctab,the coverage of a transcripts is really high and actually many reads mapped to the reference genome at this site.

e_id    chr     strand  start   end     rcount  ucount  mrcount cov     cov_sd  mcov    mcov_sd
23      scaffold_1      +       87462   87575   64      64      64.00   43.7895 11.0743 43.7807 11.0660
24      scaffold_1      +       87674   88356   253     249     251.00  47.9561 12.6755 47.5183 12.7022

But in the t_data.ctaband the GTF file, the results are different

    scaffold_1      StringTie       transcript      87462   88356   1000    +       .       gene_id "estExt_fgenesh1_pg.C_10011"; transcript_id "127031"; cov "0.920524"; FPKM "3.685930"; TPM "15.617980";
scaffold_1      StringTie       exon    87462   87575   1000    +       .       gene_id "estExt_fgenesh1_pg.C_10011"; transcript_id "127031"; exon_number "1"; cov "0.000000";
scaffold_1      StringTie       exon    87674   88356   1000    +       .       gene_id "estExt_fgenesh1_pg.C_10011"; transcript_id "127031"; exon_number "2"; cov "1.074169";

the coverage and FPKM are very low, I don't know why. Do anyone have ideas

RNA-Seq StringTie • 4.4k views

ADD COMMENT • link updated 7.4 years ago by jonasmst ▴ 410 • written 8.1 years ago by wangyang703092 ▴ 120

score 1 · Answer 1 · 2016-12-19

Not sure if this is still of interest to you, but here's my take on your question for future reference:

The coverage values for the exons in your e_data.ctab are based on observed reads covering the region of the exon, however there may be several overlapping exons in that region (see image below). The values in your e_data.ctab have a central limitation: Every read (gray boxes) is counted once for each exon it aligns to, so in the figure, exon 1A gets a coverage value of 8, and exon 1B gets a coverage value of 10. So 8+10=18 reads are used in the coverage calculations, yet there are only 10 reads actually observed.

read alignment

The values in your GTF-file is the product of the maximum flow algorithm applied by StringTie, in which the 10 reads are used only once, and distributed to the exons (as coverage metrics) based on which transcript StringTie believes to be expressed. Looking at the figure, we can tell that transcript T1 is likely not expressed, as there are reads covering a region in which there are no exons in T1. So T1 (and the exons within it) would get lower coverage values, and T2 (and its exons) would get higher coverage.

Analogous to your data, I'd bet you're looking at a T1 kind of situation, and that there is another transcript in your GTF-file with higher coverage (your T2 transcript).

TLDR: You're likely looking at coverage values not corrected for transcript expression (e_data.ctab) versus corrected coverage values for a non-expressed transcript (GTF-file).