Question: How Should I Compute Coverage When Some Reads Map To Multiple Locations?
5
gravatar for Ryan Thompson
8.0 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

I have a bam file in which some reads are mapped uniquely and some are mapped to multiple genomic positions (because of gene duplications or pseudogenes or whatever). I want to be able to calculate the coverage for any base in the genome. I know how to calculate coverage in general, but I have two questions regarding the treatment of reads that map to multiple locations.

  1. If a read maps to N locations, should I count it at each location with a depth of 1/N? In other words, should I "split" the coverage equally across the N locations? Should I use mapping quality or some other measure to do a weighted split? Or should I just count it as a depth of 1 at all locations where it maps?

  2. When reading from a bam file, how can I easily tell how many positions a read has been mapped to? Is there a field in the bam file that gives me this information? Is the read simply listed multiple times? Can I easily preprocess the bam file to add a "map count" in a custom field?

Edit

I want to calculate coverage mostly for QC-stats. I am doing exome sequencing and I want to be able to answer questions such as "What fraction of exonic nucleotides were covered at at least 5x?"

bam sam coverage • 3.8k views
ADD COMMENTlink modified 7.9 years ago by Chris Miller20k • written 8.0 years ago by Ryan Thompson3.4k

FWIW, I think counting depth as 1/N makes the most sense. Otherwise, reads from repetitive regions get more weight then non-repetitive ones - i.e. the sum contribution from each read should be 1.

ADD REPLYlink written 8.0 years ago by Ketil3.9k

For point 1, it may be worth reading Alkan et al, 2009. http://www.nature.com/ng/journal/v41/n10/abs/ng.437.html

ADD REPLYlink written 7.9 years ago by Aaronquinlan10k
5
gravatar for lh3
8.0 years ago by
lh331k
United States
lh331k wrote:
  1. This largely depends on your downstream analyses. For RNA-seq, for example, some methods distribute repetitive reads based on the depth of unique reads nearby. Perhaps others will give you more precise answers.

  2. No, you cannot easily get this information unless the mapper in use keeps in optional SAM tags. If you really want to get this, sort your BAM by read names and then do a count.

ADD COMMENTlink written 8.0 years ago by lh331k

+1 for mentioning the distributive approach

ADD REPLYlink written 7.9 years ago by Chris Evelo9.9k

+1 also for the weighted distribution of reads. You map unique reads to your entities (gene, exon, intergenic island,...) and then assign your non-unique reads regarding the coverage of the entities your read mapped to. If a non unique read map to a location with a coverage of 9 and another with a coverage of 1 then 90% of this read goes to the 1st location and 10% to the second. That's a common approach with RNA-seq as described even though there is still some alternatives.

ADD REPLYlink written 7.9 years ago by Philippe1.9k
3
gravatar for Chris Miller
8.0 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

There are two ways of dealing with this issue:

1) Use only uniquely mapping reads.

  • Advantage: solves your issue neatly
  • Disadvantage: You lose some information. Also, if you're doing binned copy number calculations and don't have a matched normal, you'll then have to correct for mapability.

2) Count each read's contribution to the depth as 1/N.

  • Advantage: Allows you to count reads in repetitive regions
  • Disadvantage: You may end up showing some coverage in regions that are not actually present in the data set. Imagine a repeat exists 10 times in the genome, but one copy is in a homozygously deleted region. If you have a thousand reads that map to any other copy of that repeat, you'll show that each repeat has coverage, including the one in the deleted region.

In response to your second question, your choice of aligner and input parameters will determine how repetitive reads are marked, if at all. Some aligners just assign ambiguously mapping reads to one of the positions randomly, but note this in the quality score.

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by Chris Miller20k

I suppose the solution lh3 mentioned for RNA-Seq, distribute repetitive reads based on the depth of unique reads nearby, would solve your second problem. Or is that not true?

ADD REPLYlink written 7.9 years ago by Chris Evelo9.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1033 users visited in the last hour