The coverage value shown in the output of StringTie (and in other genomics programs) is an average of all per-base coverages across the length of genomic segment (exon) or set of segments (transcript). For a gene (which is what is shown in the -A table), the union of all the exonic segments in the gene is considered for this calculation.
For example, an exon coverage value is calculated like this: all the read alignments intersecting the exon are considered and the coverage values for each base (i.e. the number of the read alignments covering that base) are added up and then divided by the length of that exon.
Additionally in StringTie there are a few other factors influencing that per-base coverage value (i.e. the way the number of alignments covering a base is actually "counted"):
- potential filtering of some of the read alignments (those considered suspicious/unreliable; they may appear in IGV but discarded by StringTie)
- weighing down the per-base coverage contribution of multi-mapped reads (i.e. if a read is mapped in n other places, we count its base coverage contribution as 1/n instead of 1, for each "covered" base)
- for multi-transcript genes: distributing read alignments among overlapping transcripts according to the maximum flow algorithm
Note that the last factor above, although the most complex, does not influence gene coverage values, because at that level we do not need to worry about the distribution of read alignments between transcripts/exons (as mentioned above, simply considering the segment union of all exons in the gene is enough).
modified 2.4 years ago
2.4 years ago by
geo.pertea • 80