How Should I Compute Coverage When Some Reads Map To Multiple Locations?
2
5
Entering edit mode
13.1 years ago
Ryan Thompson ★ 3.6k

I have a bam file in which some reads are mapped uniquely and some are mapped to multiple genomic positions (because of gene duplications or pseudogenes or whatever). I want to be able to calculate the coverage for any base in the genome. I know how to calculate coverage in general, but I have two questions regarding the treatment of reads that map to multiple locations.

  1. If a read maps to N locations, should I count it at each location with a depth of 1/N? In other words, should I "split" the coverage equally across the N locations? Should I use mapping quality or some other measure to do a weighted split? Or should I just count it as a depth of 1 at all locations where it maps?

  2. When reading from a bam file, how can I easily tell how many positions a read has been mapped to? Is there a field in the bam file that gives me this information? Is the read simply listed multiple times? Can I easily preprocess the bam file to add a "map count" in a custom field?

Edit

I want to calculate coverage mostly for QC-stats. I am doing exome sequencing and I want to be able to answer questions such as "What fraction of exonic nucleotides were covered at at least 5x?"

sam bam coverage • 5.9k views
ADD COMMENT
0
Entering edit mode

FWIW, I think counting depth as 1/N makes the most sense. Otherwise, reads from repetitive regions get more weight then non-repetitive ones - i.e. the sum contribution from each read should be 1.

ADD REPLY
0
Entering edit mode

For point 1, it may be worth reading Alkan et al, 2009. http://www.nature.com/ng/journal/v41/n10/abs/ng.437.html

ADD REPLY
5
Entering edit mode
13.1 years ago
lh3 33k
  1. This largely depends on your downstream analyses. For RNA-seq, for example, some methods distribute repetitive reads based on the depth of unique reads nearby. Perhaps others will give you more precise answers.

  2. No, you cannot easily get this information unless the mapper in use keeps in optional SAM tags. If you really want to get this, sort your BAM by read names and then do a count.

ADD COMMENT
0
Entering edit mode

+1 for mentioning the distributive approach

ADD REPLY
0
Entering edit mode

+1 also for the weighted distribution of reads. You map unique reads to your entities (gene, exon, intergenic island,...) and then assign your non-unique reads regarding the coverage of the entities your read mapped to. If a non unique read map to a location with a coverage of 9 and another with a coverage of 1 then 90% of this read goes to the 1st location and 10% to the second. That's a common approach with RNA-seq as described even though there is still some alternatives.

ADD REPLY
3
Entering edit mode
13.1 years ago

There are two ways of dealing with this issue:

1) Use only uniquely mapping reads.

  • Advantage: solves your issue neatly
  • Disadvantage: You lose some information. Also, if you're doing binned copy number calculations and don't have a matched normal, you'll then have to correct for mapability.

2) Count each read's contribution to the depth as 1/N.

  • Advantage: Allows you to count reads in repetitive regions
  • Disadvantage: You may end up showing some coverage in regions that are not actually present in the data set. Imagine a repeat exists 10 times in the genome, but one copy is in a homozygously deleted region. If you have a thousand reads that map to any other copy of that repeat, you'll show that each repeat has coverage, including the one in the deleted region.

In response to your second question, your choice of aligner and input parameters will determine how repetitive reads are marked, if at all. Some aligners just assign ambiguously mapping reads to one of the positions randomly, but note this in the quality score.

ADD COMMENT
0
Entering edit mode

I suppose the solution lh3 mentioned for RNA-Seq, distribute repetitive reads based on the depth of unique reads nearby, would solve your second problem. Or is that not true?

ADD REPLY

Login before adding your answer.

Traffic: 2777 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6