How To Find Out The Mismatchs Of An Alignment Entry In The Sam File?
3
9
Entering edit mode
10.0 years ago
Rnaer ▴ 120

Like this entry from Bowtie mapping result, how can I find how many and where are the mismatches in the alignment?

2358_2039_1969_F3    16    chrX    464352    255    50M    *    0    0    ATATCTATATATATGAAAAGATTGCGAACAAAAAAGATGATGGAAAAGGA    )?32OUY\\ZPLU[_]^]VX^^_[YOQccBBZYb_bccccBBcccbA    XA:i:2    MD:Z:4A45    NM:i:1    CM:i:5


Obviously I can compare the sequence with chrX:464352 of the reference genome, but this is such a pain and takes computation time to do that for millions of reads. Is there a better way?

Thanks.

sam samtools • 16k views
11
Entering edit mode
10.0 years ago
Jts ★ 1.3k

I use the samtools calmd program with the -e option which will replace the matching bases in the read with '=' symbols. Once this is done, it is easy to parse out the mismatching bases.

This command also fills in the MD tag which contains the mismatching information. From the spec:

The MD ﬁeld aims to achieve SNP/indel calling without looking at the reference. For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is diﬀerent from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD ﬁeld ought to match the CIGAR string

2
Entering edit mode
10.0 years ago

Your alignment is defined in the CIGAR string. Here

50M
`

means that there is 50 matches on the chrX starting from position 464352.

the flag '16' means that the hit is on the reverse strand.

10
Entering edit mode

A CIGAR M means 'matches or mismatches' so you can't tell how many matches there are from '50M' alone

6
Entering edit mode

hm, additionally, it's probably good to look at the tags, NM:i:1 indicates there's a mismatch in base-space, as does the MD:Z:4A45 which indicates that the mismatch occurs at the 5th position. (not sure if that's in color or base-space as your reads must be in colorspace).

1
Entering edit mode
0
Entering edit mode

Thanks, brentp. That's helpful. Where can I find the documentation for those optional fields following the QUAL string? Neither in SamTools document nor in Bowtie manual.

0
Entering edit mode

Yes, for most of them, but what about 'XA' tag? And for 'NM', what does it really mean by "Edit distance to the reference, including ambiguous bases but excluding clipping"?

0
Entering edit mode
10.0 years ago

another caveat is that with most short read aligners we are still talking about a local alignment, so what are ostensibly mismatches (and matches) at either end of the query sequence are instead soft clipped positions (S). This may or may not be how you want to score it, so rebuilding the alignment is often a necessary step.