Question

How Should I Compute Coverage When Some Reads Map To Multiple Locations?

5

Entering edit mode

14.3 years ago

Ryan Thompson ★ 3.7k

I have a bam file in which some reads are mapped uniquely and some are mapped to multiple genomic positions (because of gene duplications or pseudogenes or whatever). I want to be able to calculate the coverage for any base in the genome. I know how to calculate coverage in general, but I have two questions regarding the treatment of reads that map to multiple locations.

If a read maps to N locations, should I count it at each location with a depth of 1/N? In other words, should I "split" the coverage equally across the N locations? Should I use mapping quality or some other measure to do a weighted split? Or should I just count it as a depth of 1 at all locations where it maps?
When reading from a bam file, how can I easily tell how many positions a read has been mapped to? Is there a field in the bam file that gives me this information? Is the read simply listed multiple times? Can I easily preprocess the bam file to add a "map count" in a custom field?

Edit

I want to calculate coverage mostly for QC-stats. I am doing exome sequencing and I want to be able to answer questions such as "What fraction of exonic nucleotides were covered at at least 5x?"

sam bam coverage • 6.7k views

ADD COMMENT • link updated 14.2 years ago by Chris Miller 22k • written 14.3 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

FWIW, I think counting depth as 1/N makes the most sense. Otherwise, reads from repetitive regions get more weight then non-repetitive ones - i.e. the sum contribution from each read should be 1.

ADD REPLY • link 14.3 years ago by Ketil 4.2k

0

Entering edit mode

For point 1, it may be worth reading Alkan et al, 2009. http://www.nature.com/ng/journal/v41/n10/abs/ng.437.html

ADD REPLY • link 14.2 years ago by Aaronquinlan 12k

score 5 · Answer 1 · 2011-03-30

5

Entering edit mode

14.3 years ago

lh3 33k

This largely depends on your downstream analyses. For RNA-seq, for example, some methods distribute repetitive reads based on the depth of unique reads nearby. Perhaps others will give you more precise answers.
No, you cannot easily get this information unless the mapper in use keeps in optional SAM tags. If you really want to get this, sort your BAM by read names and then do a count.

ADD COMMENT • link 14.3 years ago by lh3 33k

0

Entering edit mode

+1 for mentioning the distributive approach

ADD REPLY • link 14.2 years ago by Chris Evelo 10k

0

Entering edit mode

+1 also for the weighted distribution of reads. You map unique reads to your entities (gene, exon, intergenic island,...) and then assign your non-unique reads regarding the coverage of the entities your read mapped to. If a non unique read map to a location with a coverage of 9 and another with a coverage of 1 then 90% of this read goes to the 1st location and 10% to the second. That's a common approach with RNA-seq as described even though there is still some alternatives.

ADD REPLY • link 14.2 years ago by Philippe ★ 1.9k

score 3 · Answer 2 · 2011-03-30

There are two ways of dealing with this issue:

1) Use only uniquely mapping reads.

Advantage: solves your issue neatly
Disadvantage: You lose some information. Also, if you're doing binned copy number calculations and don't have a matched normal, you'll then have to correct for mapability.

2) Count each read's contribution to the depth as 1/N.

Advantage: Allows you to count reads in repetitive regions
Disadvantage: You may end up showing some coverage in regions that are not actually present in the data set. Imagine a repeat exists 10 times in the genome, but one copy is in a homozygously deleted region. If you have a thousand reads that map to any other copy of that repeat, you'll show that each repeat has coverage, including the one in the deleted region.

In response to your second question, your choice of aligner and input parameters will determine how repetitive reads are marked, if at all. Some aligners just assign ambiguously mapping reads to one of the positions randomly, but note this in the quality score.