"Uniquely mappable read" is never satisfactorily defined. If we cannot define it, how can we measure uniqueness? In practice, all you need to do is to look at the mapping quality. DON'T look at any tags.
Btw, the following is something I wrote three years ago, which may help you to understand the problem with "unique mapping".
Eland is probably the first short read aligner. It reports a tag, which can be '[UR][0-2]|NM', for each read to indicate if the read is mapped uniquely, repetitively or unmapped. Since then, we are used to talking about `mapping uniqueness' and extend the concept to generic alignments without asking ourselves what is the exact definiation of uniqueness and if such a concept is useful in practice at all.
For Eland, mapping uniqness is clearly defined. A read is said to be aligned uniquely if the best two alignments have identical number of mismatches. Eland requires the full-length read to be aligned and it does not do gapped alignment. Such a definition is useful to downstream analyses.
However, when we consider base quality, the usefulness of uniqueness becomes less obvious. Suppose a read has no perfect match and two 1-mismatch hits. The first hit has a Q5 mismatch and the second has a Q30 mismatch. If the quality is accurate, the first hit is clearly better than the second. Why couldn't we trust the first hit?
In addition, as is pointed out by one of the anonymous reviewers of my BWA paper, an aligner may not be able to find the best hits if heuristics are in use, and in this case, the aligner is only able to find 'unique' reads by its own definition.
Furthermore, once we allow gaps, mapping uniqueness becomes even less useful. Firstly, we need to redefine uniqueness as we have gaps. One possible way is to define a read as being uniquely mapped if the best two alignments have identical number of differences (mismatches plus gaps). The definition is clear, but not useful. We know on Illumina reads indel errors occur rarely. A hit containing one mismatch is definitely preferred over a hit with one gap.
Things get even worse when we clip reads as what we do for capillary reads. We can only define a read being uniquely mapped when the top two alignments have identical alignment score. However, this is almost practically useless at all. For long reads, frequently we get alignments with similar scores, but we seldom get two with identical scores.
Uniqueness was initially introduced to measure the reliability of ungapped short read alignment with a read aligned in full length. It is not a proper concept for generic alignments. For generic alignments, what is much more useful is mapping quality, first introduced in my maq paper. Mapping quality is phred-scaled probability of the alignment being wrong. It unambiguously measures the alignment reliability in a universal way. Calculating mapping quality is related to a proper statistical alignment/error model, and this is the right thing to do. I would strongly recommend all aligners to report mapping quality. Mapping uniqueness was not widely used two years ago and will not be widely used two years later. It is just a temporary concept, reflecting our lack of knowledge on measuring the reliability of an alignment.
What are you using to align the reads? Have you checked the samtools view documentation (http://samtools.sourceforge.net/SAM1.pdf)? Using the -F flag, you can filter out unmapped reads and PCR duplicates, provided your aligner uses the SAM format specifications for their flags.
Thanks, yes, to filter out PCR and unmapped I can use the -F 1797 flag, but I am more interested in removing the reads that are non-uniquely mappable.
I believe, reads that have exact matches to other locations in the genome in BWA will have mapping quality scores of 0. You can use the -q flag to set a minimum mapping quality threshold to exclude reads that map to multiple places in the genome.
"Non-unique read is placed randomly with a mapping quality 0; all hits can be outputted in a concise format." http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_li.pdf
I'm unclear on how what I said is wrong. Any read that maps exactly to multiple places in the genome is a non-unique read. It has a mapping quality of 0. Where is my mistake?
You are right. Sorry, I was thinking inverse for some reason today.
Not a problem. If I had been wrong, I've been fundamentally misunderstanding BWA's output for a very long time!