How To Find Out The Mismatchs Of An Alignment Entry In The Sam File?
3
9
Entering edit mode
10.0 years ago
Rnaer ▴ 120

Like this entry from Bowtie mapping result, how can I find how many and where are the mismatches in the alignment?

2358_2039_1969_F3    16    chrX    464352    255    50M    *    0    0    ATATCTATATATATGAAAAGATTGCGAACAAAAAAGATGATGGAAAAGGA    )?32OUY\`\ZPLU[_]^`]VX^^_[YOQccBBZYb`_bccccBBcccbA    XA:i:2    MD:Z:4A45    NM:i:1    CM:i:5

Obviously I can compare the sequence with chrX:464352 of the reference genome, but this is such a pain and takes computation time to do that for millions of reads. Is there a better way?

Thanks.

sam samtools • 16k views
ADD COMMENT
11
Entering edit mode
10.0 years ago
Jts ★ 1.3k

I use the samtools calmd program with the -e option which will replace the matching bases in the read with '=' symbols. Once this is done, it is easy to parse out the mismatching bases.

This command also fills in the MD tag which contains the mismatching information. From the spec:

The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string

ADD COMMENT
2
Entering edit mode
10.0 years ago

Your alignment is defined in the CIGAR string. Here

50M

means that there is 50 matches on the chrX starting from position 464352.

the flag '16' means that the hit is on the reverse strand.

See http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/ for more information about the CIGAR format.

ADD COMMENT
10
Entering edit mode

A CIGAR M means 'matches or mismatches' so you can't tell how many matches there are from '50M' alone

ADD REPLY
6
Entering edit mode

hm, additionally, it's probably good to look at the tags, NM:i:1 indicates there's a mismatch in base-space, as does the MD:Z:4A45 which indicates that the mismatch occurs at the 5th position. (not sure if that's in color or base-space as your reads must be in colorspace).

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks, brentp. That's helpful. Where can I find the documentation for those optional fields following the QUAL string? Neither in SamTools document nor in Bowtie manual.

ADD REPLY
0
Entering edit mode

Yes, for most of them, but what about 'XA' tag? And for 'NM', what does it really mean by "Edit distance to the reference, including ambiguous bases but excluding clipping"?

ADD REPLY
0
Entering edit mode
10.0 years ago

another caveat is that with most short read aligners we are still talking about a local alignment, so what are ostensibly mismatches (and matches) at either end of the query sequence are instead soft clipped positions (S). This may or may not be how you want to score it, so rebuilding the alignment is often a necessary step.

ADD COMMENT

Login before adding your answer.

Traffic: 1273 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6