BWA/samtools output match length and SNPs number of each reads?
0
0
Entering edit mode
9.4 years ago
2012201024 ▴ 20

When using BWA/samtools, does anyone know how to output the SNP number and math length (Minimum seed length) of each reads?

For example, BOWTIE returns something like this for each result: 64:G>N,66:G>N (SNP number: 2) or if some other tools can solve.

Many thanks in advance

alignment snp • 2.4k views
ADD COMMENT
2
Entering edit mode

Doesn't BWA produce MD auxiliary tags? This information is stored in there (likewise with bowtie2 or bowtie1 when using the SAM format).

ADD REPLY
0
Entering edit mode

Thanks a lot.

I have got MD tags through:

samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>

Furthermore, I want to analysis the SNPs number (even Nucleotide substitution rate) of each reads compared with the reference.

Are there some python scripts which can solve this?

ADD REPLY
1
Entering edit mode

Thanks a lot.

I have got MD tags through:

samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>

Furthermore, I want to analysis the SNPs number (even Nucleotide substitution rate) of each reads compared with reference.

Are there some python scripts which can solve this?

ADD REPLY
0
Entering edit mode

Perhaps someone has written such a tool, but if not it wouldn't be difficult for you to do so.

ADD REPLY
0
Entering edit mode
Op BAM Description
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
D 2 deletion from the reference
N 3 skipped region from the reference
S 4 soft clipping (clipped sequences present in SEQ)
H 5 hard clipping (clipped sequences NOT present in SEQ)
P 6 padding (silent deletion from padded reference)
= 7 sequence match
X 8 sequence mismatch
Reference: ...CTTCTATTATCCTT...  M       =/X         MD
     Read:    CTTCTATTATCCTT     14M     14=         14       // example 1
     Read:    CTTATATTATCCTT     14M     3=1X10=     3C10     // example 2
     Read:    CTTATATTGGCCTT     14M     3=1X4=2X4=  3C4AT4
     Read:    CTTCTATTGGCCTT     14M     8=2X4=      8AT4
     Read:    TTTATATTATCCTG     14M     1X12=1X     0C12T0

If I want to calculate nucleotide substitution rate, can I count the number of "A,T,C,G" in MD, divide by M+number of nucleotide substitution? Should I consider the length of InDels? Thanks!

ADD REPLY
0
Entering edit mode

That's what I'd do. You can ignore indels if you just want the substitution rate.

ADD REPLY

Login before adding your answer.

Traffic: 2892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6