Question

Calculating the gene coverage in StringTie

2

Entering edit mode

8.0 years ago

SJ Basu ▴ 60

Hello People,

I ran the new tuxedo pipeline for a reference based transcriptome analysis on HumanGenome19. After running HISAT2, I executed StringTie with -A option and got the abundance file and it looks like the following

Gene ID Gene Name Reference Strand Start End Coverage FPKM TPM

ENSG00000223972 DDX11L1 1 + 11869 14412 0.261954 0.056163 0.152539

ENSG00000227232 WASH7P 1 - 14363 29806 23.156389 5.585505 15.170202

ENSG00000243485 MIR1302-10 1 + 29554 31109 0.093882 0.048981 0.133031

Now what I would like to know is how is the "Coverage" value in 7th column calculated and in what units (or is it in percentage) ??

In the manual page, it shows "cov: The average per-base coverage for the transcript or exon" but still the calculation is not clear to me.

RNA-Seq StringTie Coverage Gene • 8.2k views

ADD COMMENT • link updated 7.5 years ago by geo.pertea ▴ 130 • written 8.0 years ago by SJ Basu ▴ 60

0

Entering edit mode

did you already get the answer from somewhere else?

ADD REPLY • link 7.5 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Actually I have tried to look for the answer elsewhere but to no success ...and I also tried to calculate and arrive at the given results by looking at reads mapped at each bases of the shown genes in IGV and still couldn't regenerate the result values !!! ...

ADD REPLY • link 7.5 years ago by SJ Basu ▴ 60

score 9 · Accepted Answer · 2018-01-27

The coverage value shown in the output of StringTie (and in other genomics programs) is an average of all per-base coverages across the length of genomic segment (exon) or set of segments (transcript). For a gene (which is what is shown in the -A table), the union of all the exonic segments in the gene is considered for this calculation.

For example, an exon coverage value is calculated like this: all the read alignments intersecting the exon are considered and the coverage values for each base (i.e. the number of the read alignments covering that base) are added up and then divided by the length of that exon.

Additionally in StringTie there are a few other factors influencing that per-base coverage value (i.e. the way the number of alignments covering a base is actually "counted"):

potential filtering of some of the read alignments (those considered suspicious/unreliable; they may appear in IGV but discarded by StringTie)
weighing down the per-base coverage contribution of multi-mapped reads (i.e. if a read is mapped in n other places, we count its base coverage contribution as 1/n instead of 1, for each "covered" base)
for multi-transcript genes: distributing read alignments among overlapping transcripts according to the maximum flow algorithm

Note that the last factor above, although the most complex, does not influence gene coverage values, because at that level we do not need to worry about the distribution of read alignments between transcripts/exons (as mentioned above, simply considering the segment union of all exons in the gene is enough).