Question

Extract Only Uniquely Mappable Reads From Bam Files

10

Entering edit mode

11.9 years ago

dfernan ▴ 770

Hi,

I have a bam file that I want to filter by the following conditions:

Filter out all PCR duplicates - for this one I can use -F 1797 flag
Filter out all reads that can map to more than one region in the genome, i.e., I want to keep only uniquely mappable reads.

Is there a one command line trick using samtools or some other software?

Help will be appreciated, thanks!

Note. To give some background on my bam files, the fasq files were aligned using MAQ, and three lines of the bam file are outputted below.

D13G1ACXX120824:7:2212:16446:6529    16    chrM    7    71    36M    *    0    0    GTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTT    >>@BCBA@@A<?E8ABDD@@BDD?DAD@=?;?A=7<    PG:Z:D13G1ACXX.7    RG:Z:D13G1.7    NM:i:0    OQ:Z:EIHHIIIIIHCFH@IIIIIIGGHFHHHHDDDDD@<@    UQ:i:0
D13G1ACXX120824:8:2306:12417:9715    1040    chrM    7    70    36M    *    0    0    GTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTT    ><ABCB;@>>??DA;BAD@ACBD?AAC>@?>AB>7<    PG:Z:D13G1ACXX.8    RG:Z:D13G1.8    NM:i:0    OQ:Z:GEJIIHCIHADHFIFGEHGIGIHFDHGFFDFFF@@@    UQ:i:0
D13G1ACXX120824:7:2211:20706:14605    1024    chrM    13    73    36M    *    0    0    TAATAACAAAGCAAAGCACTGAAAATGCTTAGATGG    >6A=@>BABCDCBBCDDBCBCCCCC@DEB@AD@>AA    PG:Z:D13G1ACXX.7    RG:Z:D13G1.7    NM:i:0    OQ:Z:BBCFFDFEHHHDHIGJIIJIIJJJJIIJJJJJGIIJ    UQ:i:0
D13G1ACXX120824:7:2215:15973:86694    1024    chrM    13    68    36M    *    0    0    TNATAACAAAGCAAAGCACTGAAAATGCTTAGATGG    <#0:<>BB?CBDCBC?9BCBCBCCB?AEB@@C@=>=    PG:Z:D13G1ACXX.7    RG:Z:D13G1.7    NM:i:1    OQ:Z:@#1A?DFFFGGHGIIB>GIIIIIIHGEIIIIIGGFF    UQ:i:2

bam samtools • 42k views

ADD COMMENT • link updated 11.2 years ago by yifangt • 0 • written 11.9 years ago by dfernan ▴ 770

0

Entering edit mode

What are you using to align the reads? Have you checked the samtools view documentation (http://samtools.sourceforge.net/SAM1.pdf)? Using the -F flag, you can filter out unmapped reads and PCR duplicates, provided your aligner uses the SAM format specifications for their flags.

ADD REPLY • link 11.9 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

Thanks, yes, to filter out PCR and unmapped I can use the -F 1797 flag, but I am more interested in removing the reads that are non-uniquely mappable.

ADD REPLY • link 11.9 years ago by dfernan ▴ 770

3

Entering edit mode

I believe, reads that have exact matches to other locations in the genome in BWA will have mapping quality scores of 0. You can use the -q flag to set a minimum mapping quality threshold to exclude reads that map to multiple places in the genome.

ADD REPLY • link 11.9 years ago by Mitch Bekritsky ★ 1.3k

2

Entering edit mode

"Non-unique read is placed randomly with a mapping quality 0; all hits can be outputted in a concise format." http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_li.pdf

ADD REPLY • link 11.9 years ago by Zev.Kronenberg 12k

3

Entering edit mode

I'm unclear on how what I said is wrong. Any read that maps exactly to multiple places in the genome is a non-unique read. It has a mapping quality of 0. Where is my mistake?

ADD REPLY • link 11.9 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

You are right. Sorry, I was thinking inverse for some reason today.

ADD REPLY • link 11.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Not a problem. If I had been wrong, I've been fundamentally misunderstanding BWA's output for a very long time!

ADD REPLY • link 11.9 years ago by Mitch Bekritsky ★ 1.3k

score 26 · Answer 1 · 2012-12-17

"Uniquely mappable read" is never satisfactorily defined. If we cannot define it, how can we measure uniqueness? In practice, all you need to do is to look at the mapping quality. DON'T look at any tags.

Btw, the following is something I wrote three years ago, which may help you to understand the problem with "unique mapping".

Eland is probably the first short read aligner. It reports a tag, which can be '[UR][0-2]|NM', for each read to indicate if the read is mapped uniquely, repetitively or unmapped. Since then, we are used to talking about `mapping uniqueness' and extend the concept to generic alignments without asking ourselves what is the exact definiation of uniqueness and if such a concept is useful in practice at all.

For Eland, mapping uniqness is clearly defined. A read is said to be aligned uniquely if the best two alignments have identical number of mismatches. Eland requires the full-length read to be aligned and it does not do gapped alignment. Such a definition is useful to downstream analyses.

However, when we consider base quality, the usefulness of uniqueness becomes less obvious. Suppose a read has no perfect match and two 1-mismatch hits. The first hit has a Q5 mismatch and the second has a Q30 mismatch. If the quality is accurate, the first hit is clearly better than the second. Why couldn't we trust the first hit?

In addition, as is pointed out by one of the anonymous reviewers of my BWA paper, an aligner may not be able to find the best hits if heuristics are in use, and in this case, the aligner is only able to find 'unique' reads by its own definition.

Furthermore, once we allow gaps, mapping uniqueness becomes even less useful. Firstly, we need to redefine uniqueness as we have gaps. One possible way is to define a read as being uniquely mapped if the best two alignments have identical number of differences (mismatches plus gaps). The definition is clear, but not useful. We know on Illumina reads indel errors occur rarely. A hit containing one mismatch is definitely preferred over a hit with one gap.

Things get even worse when we clip reads as what we do for capillary reads. We can only define a read being uniquely mapped when the top two alignments have identical alignment score. However, this is almost practically useless at all. For long reads, frequently we get alignments with similar scores, but we seldom get two with identical scores.

Uniqueness was initially introduced to measure the reliability of ungapped short read alignment with a read aligned in full length. It is not a proper concept for generic alignments. For generic alignments, what is much more useful is mapping quality, first introduced in my maq paper. Mapping quality is phred-scaled probability of the alignment being wrong. It unambiguously measures the alignment reliability in a universal way. Calculating mapping quality is related to a proper statistical alignment/error model, and this is the right thing to do. I would strongly recommend all aligners to report mapping quality. Mapping uniqueness was not widely used two years ago and will not be widely used two years later. It is just a temporary concept, reflecting our lack of knowledge on measuring the reliability of an alignment.

score 4 · Answer 2 · 2012-12-17

From the MAQ documentation:

The calculation of mapping qualities is simple, but this simple calculation considers all the factors below:

...

The repeat structure of the reference. Reads falling in repetitive regions usually get very low mapping quality.

Also,

When you see a read alignment can get a mapping quality 30, it usually implies:

The overall base quality of the read is good.

The best alignment has few mismatches.

The read has few or just one `good' hit on the reference, which means the current alignment is still the best even if one or two bases are actually mutations or sequencing errors.

By setting a -q threshold to 30 (since you have single-end reads), the reads should be mainly uniquely mappable. If you'd like to to be even more stringent, I'm afraid you may have to re-align your reads, if the output you've gotten is like what you have above... :(

score 2 · Answer 3 · 2012-12-17

2

Entering edit mode

11.9 years ago

Zev.Kronenberg 12k

look for H0, H1, and H2 tags in the optional field. Although it really depends on what aligner you used. They don't report the same optional fields.

If you can redo the alignment you can use bowtie2 and set -k 2 and filter out the reads that have a secondary hit.

ADD COMMENT • link 11.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Thanks. The sequencing center gave me the aligned reads in bam format, they used BWA for aligning. I just edited the Q to show the bam file and the aligner I used.

ADD REPLY • link 11.9 years ago by dfernan ▴ 770

0

Entering edit mode

see if you have the NH field? NH: i Number of reported alignments that contains the query in the current record

ADD REPLY • link 11.9 years ago by Zev.Kronenberg 12k

0

Entering edit mode

@Zev.Kronenberg thanks a lot. mmm seems they do not have the NH file either. I have ~ 200 files like this which makes it annoying to have to remap them all again but so far seems the only sensitive solution given that they do not have the NH field.

ADD REPLY • link 11.9 years ago by dfernan ▴ 770

0

Entering edit mode

What fields do you have in your bams? @Mitch.Bekritsky is right set a mapping quality filter. It should be the simplest solution.

ADD REPLY • link 11.9 years ago by Zev.Kronenberg 12k

2

Entering edit mode

Looking at mapping quality is not only the simplest but also the best solution in most cases.

ADD REPLY • link 11.9 years ago by lh3 33k

score 0 · Answer 4 · 2013-08-29

0

Entering edit mode

11.2 years ago

yifangt • 0

I am still not quite clear about the option to get uniquely mapped reads from BAM/SAM file? Can any body clarify this and post the command line(s) that should be used? My second question is related: What is the option to get the perfectly mapped reads, i.e. mapped reads WITHOUT mismatch, from BAM/SAM file? Sorry for my confusion. I have tried grep "\sM[0-9]+\s" to filter out any mapped reads with insertion or deletion, but M can mean mismatch too, so that the result still contains mapped reads with mismatch. Thanks a lot!

ADD COMMENT • link 11.2 years ago by yifangt • 0

1

Entering edit mode

Maybe you should spin this off into a new question? People may want to help, but posting as an answer to someone else's question will make your question harder to find. Also, what mapping algorithm are you using?

ADD REPLY • link 11.2 years ago by Mitch Bekritsky ★ 1.3k