Extract Only Uniquely Mappable Reads From Bam Files
4
10
Entering edit mode
10.1 years ago
dfernan ▴ 720

Hi,

I have a bam file that I want to filter by the following conditions:

• Filter out all PCR duplicates - for this one I can use -F 1797 flag
• Filter out all reads that can map to more than one region in the genome, i.e., I want to keep only uniquely mappable reads.

Is there a one command line trick using samtools or some other software?

Help will be appreciated, thanks!

Note. To give some background on my bam files, the fasq files were aligned using MAQ, and three lines of the bam file are outputted below.

D13G1ACXX120824:7:2212:16446:6529    16    chrM    7    71    36M    *    0    0    GTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTT    >>@BCBA@@A<?E8ABDD@@BDD?DAD@=?;?A=7<    PG:Z:D13G1ACXX.7    RG:Z:D13G1.7    NM:i:0    OQ:Z:EIHHIIIIIHCFH@IIIIIIGGHFHHHHDDDDD@<@    UQ:i:0
D13G1ACXX120824:8:2306:12417:9715    1040    chrM    7    70    36M    *    0    0    GTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTT    ><ABCB;@>>??DA;BAD@ACBD?AAC>@?>AB>7<    PG:Z:D13G1ACXX.8    RG:Z:D13G1.8    NM:i:0    OQ:Z:GEJIIHCIHADHFIFGEHGIGIHFDHGFFDFFF@@@    UQ:i:0
D13G1ACXX120824:7:2211:20706:14605    1024    chrM    13    73    36M    *    0    0    TAATAACAAAGCAAAGCACTGAAAATGCTTAGATGG    >6A=@>BABCDCBBCDDBCBCCCCC@DEB@AD@>AA    PG:Z:D13G1ACXX.7    RG:Z:D13G1.7    NM:i:0    OQ:Z:BBCFFDFEHHHDHIGJIIJIIJJJJIIJJJJJGIIJ    UQ:i:0
D13G1ACXX120824:7:2215:15973:86694    1024    chrM    13    68    36M    *    0    0    TNATAACAAAGCAAAGCACTGAAAATGCTTAGATGG    <#0:<>BB?CBDCBC?9BCBCBCCB?AEB@@C@=>=    PG:Z:D13G1ACXX.7    RG:Z:D13G1.7    NM:i:1    OQ:Z:@#1A?DFFFGGHGIIB>GIIIIIIHGEIIIIIGGFF    UQ:i:2

bam samtools • 36k views
0
Entering edit mode

What are you using to align the reads? Have you checked the samtools view documentation (http://samtools.sourceforge.net/SAM1.pdf)? Using the -F flag, you can filter out unmapped reads and PCR duplicates, provided your aligner uses the SAM format specifications for their flags.

0
Entering edit mode

Thanks, yes, to filter out PCR and unmapped I can use the -F 1797 flag, but I am more interested in removing the reads that are non-uniquely mappable.

3
Entering edit mode

I believe, reads that have exact matches to other locations in the genome in BWA will have mapping quality scores of 0. You can use the -q flag to set a minimum mapping quality threshold to exclude reads that map to multiple places in the genome.

2
Entering edit mode

"Non-unique read is placed randomly with a mapping quality 0; all hits can be outputted in a concise format." http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_li.pdf

3
Entering edit mode

I'm unclear on how what I said is wrong. Any read that maps exactly to multiple places in the genome is a non-unique read. It has a mapping quality of 0. Where is my mistake?

0
Entering edit mode

You are right. Sorry, I was thinking inverse for some reason today.

0
Entering edit mode

Not a problem. If I had been wrong, I've been fundamentally misunderstanding BWA's output for a very long time!

26
Entering edit mode
10.1 years ago
lh3 33k

"Uniquely mappable read" is never satisfactorily defined. If we cannot define it, how can we measure uniqueness? In practice, all you need to do is to look at the mapping quality. DON'T look at any tags.

Btw, the following is something I wrote three years ago, which may help you to understand the problem with "unique mapping".

Eland is probably the first short read aligner. It reports a tag, which can be '[UR][0-2]|NM', for each read to indicate if the read is mapped uniquely, repetitively or unmapped. Since then, we are used to talking about mapping uniqueness' and extend the concept to generic alignments without asking ourselves what is the exact definiation of uniqueness and if such a concept is useful in practice at all.

For Eland, mapping uniqness is clearly defined. A read is said to be aligned uniquely if the best two alignments have identical number of mismatches. Eland requires the full-length read to be aligned and it does not do gapped alignment. Such a definition is useful to downstream analyses.

However, when we consider base quality, the usefulness of uniqueness becomes less obvious. Suppose a read has no perfect match and two 1-mismatch hits. The first hit has a Q5 mismatch and the second has a Q30 mismatch. If the quality is accurate, the first hit is clearly better than the second. Why couldn't we trust the first hit?

In addition, as is pointed out by one of the anonymous reviewers of my BWA paper, an aligner may not be able to find the best hits if heuristics are in use, and in this case, the aligner is only able to find 'unique' reads by its own definition.

Furthermore, once we allow gaps, mapping uniqueness becomes even less useful. Firstly, we need to redefine uniqueness as we have gaps. One possible way is to define a read as being uniquely mapped if the best two alignments have identical number of differences (mismatches plus gaps). The definition is clear, but not useful. We know on Illumina reads indel errors occur rarely. A hit containing one mismatch is definitely preferred over a hit with one gap.

Things get even worse when we clip reads as what we do for capillary reads. We can only define a read being uniquely mapped when the top two alignments have identical alignment score. However, this is almost practically useless at all. For long reads, frequently we get alignments with similar scores, but we seldom get two with identical scores.

Uniqueness was initially introduced to measure the reliability of ungapped short read alignment with a read aligned in full length. It is not a proper concept for generic alignments. For generic alignments, what is much more useful is mapping quality, first introduced in my maq paper. Mapping quality is phred-scaled probability of the alignment being wrong. It unambiguously measures the alignment reliability in a universal way. Calculating mapping quality is related to a proper statistical alignment/error model, and this is the right thing to do. I would strongly recommend all aligners to report mapping quality. Mapping uniqueness was not widely used two years ago and will not be widely used two years later. It is just a temporary concept, reflecting our lack of knowledge on measuring the reliability of an alignment.

0
Entering edit mode

@lh3 what a detailed and well-explained answer! thanks a lot. In our case we are trying to analyze signal in repetetive regions so we only want reads that we are very confident about. From your experience what would be a good/sensitive mapq threshold? Mitch Bekritsky suggested a threshold of 30, does that sound like a good one? Which one would you suggest in the case of BWA (not MAQ) output? We use both aligners in our analysis group, and mostly hiseq 2000. Thanks again.

1
Entering edit mode

Just to clarify, I did not say that a mapping quality score of 30 is the right threshold to use. A mapping quality threshold of 30 is what the MAQ documentation suggests would be a good threshold to use. Depending on what you're sequencing project entails, you may want to set that threshold higher or lower.

0
Entering edit mode

@Mitch, thanks for the clarification. Seems like a reasonable threshold for my case though.

0
Entering edit mode

Glad to hear it! I just wanted to be clear that, like Heng, I don't mean to imply that there is a hard and fast threshold for mapping quality no matter the mapping software used and the application being developed.

0
Entering edit mode

Dear BWA Author , About my question, I have searched so long time on the internet ,but find nothing useful. somebody says : XT ,XA , samtools view -q 1, samtools view -q 30, samtools view -q 40 and so on. SO which should I to confirm and use ?? I just wanna keep uniquely mapped reads for NIPT.

thank you !

4
Entering edit mode
10.1 years ago
Mitch Bekritsky ★ 1.3k

From the MAQ documentation:

The calculation of mapping qualities is simple, but this simple calculation considers all the factors below:

...

• The repeat structure of the reference. Reads falling in repetitive regions usually get very low mapping quality.

Also,

When you see a read alignment can get a mapping quality 30, it usually implies:

• The overall base quality of the read is good.

• The best alignment has few mismatches.

• The read has few or just one good' hit on the reference, which means the current alignment is still the best even if one or two bases are actually mutations or sequencing errors.

By setting a -q threshold to 30 (since you have single-end reads), the reads should be mainly uniquely mappable. If you'd like to to be even more stringent, I'm afraid you may have to re-align your reads, if the output you've gotten is like what you have above... :(

2
Entering edit mode
10.1 years ago

look for H0, H1, and H2 tags in the optional field. Although it really depends on what aligner you used. They don't report the same optional fields.

If you can redo the alignment you can use bowtie2 and set -k 2 and filter out the reads that have a secondary hit.

0
Entering edit mode

Thanks. The sequencing center gave me the aligned reads in bam format, they used BWA for aligning. I just edited the Q to show the bam file and the aligner I used.

0
Entering edit mode

see if you have the NH field? NH: i Number of reported alignments that contains the query in the current record

0
Entering edit mode

@Zev.Kronenberg thanks a lot. mmm seems they do not have the NH file either. I have ~ 200 files like this which makes it annoying to have to remap them all again but so far seems the only sensitive solution given that they do not have the NH field.

0
Entering edit mode

What fields do you have in your bams? @Mitch.Bekritsky is right set a mapping quality filter. It should be the simplest solution.

2
Entering edit mode

Looking at mapping quality is not only the simplest but also the best solution in most cases.

0
Entering edit mode
9.4 years ago
yifangt • 0

I am still not quite clear about the option to get uniquely mapped reads from BAM/SAM file? Can any body clarify this and post the command line(s) that should be used? My second question is related: What is the option to get the perfectly mapped reads, i.e. mapped reads WITHOUT mismatch, from BAM/SAM file? Sorry for my confusion. I have tried grep "\sM[0-9]+\s" to filter out any mapped reads with insertion or deletion, but M can mean mismatch too, so that the result still contains mapped reads with mismatch. Thanks a lot!

1
Entering edit mode

Maybe you should spin this off into a new question? People may want to help, but posting as an answer to someone else's question will make your question harder to find. Also, what mapping algorithm are you using?