Question: How To Find Out The Mismatchs Of An Alignment Entry In The Sam File?
8
gravatar for Rnaer
7.9 years ago by
Rnaer110
United States
Rnaer110 wrote:

Like this entry from Bowtie mapping result, how can I find how many and where are the mismatches in the alignment?

2358_2039_1969_F3    16    chrX    464352    255    50M    *    0    0    ATATCTATATATATGAAAAGATTGCGAACAAAAAAGATGATGGAAAAGGA    )?32OUY\`\ZPLU[_]^`]VX^^_[YOQccBBZYb`_bccccBBcccbA    XA:i:2    MD:Z:4A45    NM:i:1    CM:i:5

Obviously I can compare the sequence with chrX:464352 of the reference genome, but this is such a pain and takes computation time to do that for millions of reads. Is there a better way?

Thanks.

samtools sam • 13k views
ADD COMMENTlink modified 7.9 years ago by Jts1.2k • written 7.9 years ago by Rnaer110
11
gravatar for Jts
7.9 years ago by
Jts1.2k
Jts1.2k wrote:

I use the samtools calmd program with the -e option which will replace the matching bases in the read with '=' symbols. Once this is done, it is easy to parse out the mismatching bases.

This command also fills in the MD tag which contains the mismatching information. From the spec:

The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Jts1.2k
2
gravatar for Pierre Lindenbaum
7.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

Your alignment is defined in the CIGAR string. Here

50M

means that there is 50 matches on the chrX starting from position 464352.

the flag '16' means that the hit is on the reverse strand.

See http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/ for more information about the CIGAR format.

ADD COMMENTlink written 7.9 years ago by Pierre Lindenbaum116k
10

A CIGAR M means 'matches or mismatches' so you can't tell how many matches there are from '50M' alone

ADD REPLYlink written 7.9 years ago by iw9oel_ad6.0k
6

hm, additionally, it's probably good to look at the tags, NM:i:1 indicates there's a mismatch in base-space, as does the MD:Z:4A45 which indicates that the mismatch occurs at the 5th position. (not sure if that's in color or base-space as your reads must be in colorspace).

ADD REPLYlink written 7.9 years ago by brentp22k
1

It's in the SAM format specification http://samtools.sourceforge.net/SAM-1.3.pdf

ADD REPLYlink written 7.9 years ago by iw9oel_ad6.0k

Thanks, brentp. That's helpful. Where can I find the documentation for those optional fields following the QUAL string? Neither in SamTools document nor in Bowtie manual.

ADD REPLYlink written 7.9 years ago by Rnaer110

Yes, for most of them, but what about 'XA' tag? And for 'NM', what does it really mean by "Edit distance to the reference, including ambiguous bases but excluding clipping"?

ADD REPLYlink written 7.9 years ago by Rnaer110
0
gravatar for Jeremy Leipzig
7.9 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

another caveat is that with most short read aligners we are still talking about a local alignment, so what are ostensibly mismatches (and matches) at either end of the query sequence are instead soft clipped positions (S). This may or may not be how you want to score it, so rebuilding the alignment is often a necessary step.

ADD COMMENTlink written 7.9 years ago by Jeremy Leipzig18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1765 users visited in the last hour