Question: Calculating the gene coverage in StringTie
2
gravatar for SJ Basu
2.9 years ago by
SJ Basu30
India
SJ Basu30 wrote:

Hello People,

I ran the new tuxedo pipeline for a reference based transcriptome analysis on HumanGenome19. After running HISAT2, I executed StringTie with -A option and got the abundance file and it looks like the following

Gene ID Gene Name Reference Strand Start End Coverage FPKM TPM

ENSG00000223972 DDX11L1 1 + 11869 14412 0.261954 0.056163 0.152539

ENSG00000227232 WASH7P 1 - 14363 29806 23.156389 5.585505 15.170202

ENSG00000243485 MIR1302-10 1 + 29554 31109 0.093882 0.048981 0.133031

Now what I would like to know is how is the "Coverage" value in 7th column calculated and in what units (or is it in percentage) ??

In the manual page, it shows "cov: The average per-base coverage for the transcript or exon" but still the calculation is not clear to me.

rna-seq coverage stringtie gene • 3.1k views
ADD COMMENTlink modified 2.4 years ago by geo.pertea80 • written 2.9 years ago by SJ Basu30

did you already get the answer from somewhere else?

ADD REPLYlink written 2.4 years ago by lakhujanivijay5.0k

Actually I have tried to look for the answer elsewhere but to no success ...and I also tried to calculate and arrive at the given results by looking at reads mapped at each bases of the shown genes in IGV and still couldn't regenerate the result values !!! ...

ADD REPLYlink written 2.4 years ago by SJ Basu30
5
gravatar for geo.pertea
2.4 years ago by
geo.pertea80
geo.pertea80 wrote:

The coverage value shown in the output of StringTie (and in other genomics programs) is an average of all per-base coverages across the length of genomic segment (exon) or set of segments (transcript). For a gene (which is what is shown in the -A table), the union of all the exonic segments in the gene is considered for this calculation.

For example, an exon coverage value is calculated like this: all the read alignments intersecting the exon are considered and the coverage values for each base (i.e. the number of the read alignments covering that base) are added up and then divided by the length of that exon.

Additionally in StringTie there are a few other factors influencing that per-base coverage value (i.e. the way the number of alignments covering a base is actually "counted"):

  • potential filtering of some of the read alignments (those considered suspicious/unreliable; they may appear in IGV but discarded by StringTie)
  • weighing down the per-base coverage contribution of multi-mapped reads (i.e. if a read is mapped in n other places, we count its base coverage contribution as 1/n instead of 1, for each "covered" base)
  • for multi-transcript genes: distributing read alignments among overlapping transcripts according to the maximum flow algorithm

Note that the last factor above, although the most complex, does not influence gene coverage values, because at that level we do not need to worry about the distribution of read alignments between transcripts/exons (as mentioned above, simply considering the segment union of all exons in the gene is enough).

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by geo.pertea80

Finally have the answer from The Author !!...Though I guessed some of these facts but was in need for confirmation...These facts answer a few other queries regarding the tool too.

Thank you so much Geo.pertea.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by SJ Basu30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 876 users visited in the last hour