Question

Understanding the key concept of per gene coverage in context to WGS

0

Entering edit mode

7.3 years ago

lakhujanivijay 5.8k

I may be asking a very basic question, but it has been quite a discussion topic for me!

What should we infer while we say per gene coverage ? I am asking this in context with whole genome sequencing.

for each gene in question, how many reads aligned completely/partially to it? This could be calculated by simple tools like HTSeq (coded in Python). Obviously, it is possible that 10 out of 10 reads mapped to particular region of the gene and hence this is not in true sense coverage means.
for each gene in question, calculate the number of N's (gaps) and then coverage would be calculated as ( ( gene_length - no. of gaps ) / gene_length ) * 100. A small perl/python script or a awk one liner will be enough. This will give how much region of the gene in question was covered by atleast one read/base.

Or am I entirely misunderstanding the key concept of per gene coverage (it is neither 1 or 2).

Any insights and ways to calculate the same will be really helpful.

wgs per gene coverage htseq • 1.8k views

ADD COMMENT • link updated 7.3 years ago by Brian Bushnell 20k • written 7.3 years ago by lakhujanivijay 5.8k

score 0 · Answer 1 · 2016-12-28

There might not be a useful universal definition. I can think of many that would be situationally useful, though. Particularly for RNA-seq, it might be interesting to measure the highest depth of any gene in an exon, and consider that the gene's depth. If all isoforms share a certain subset of exons, then average coverage across those exons might be used as the depth. Otherwise, one could simple average the coverage across all exons and call that the depth. Or, "the gene is covered by 15,000 reads" - that sounds like a useful statement, and is not affected by differential splicing or read length, which is convenient. Usually I think of coverage as equivalent to depth, though.

For whole-genome DNA sequencing, it's less clear to me where "per gene coverage" is relevant, but I'd probably calculate it by counting the number of read bases that align to exonic bases (counting only match/mismatch/noref, not indels) and dividing by the sum of the length of said exons.