I am trying to find some statistics of mismatches and indels from SAM/BAM file. The SAM file is generated using BWA. The statistics should include the %mismatch and %indel for each aligned reads. I am wondering if there are any good tools I could use.
You can also try alfred. It needs a sorted & indexed BAM file and the reference genome you used for the alignment.
alfred qc -r <reference.fa> <align.bam>
It computes the error rates you are looking for and some other metrics (insert size, coverage, ...).
Is there an executable? I am getting error while compiling.
Yes, there are statically compiled binaries available here and we did run it previously on Nanopore, Illumina and PacBio reads but if you experience any problems please let me know.
Looks nice, did you announce it at the Tools section?
Thanks, I have not created a Tools page in Biostars but there is a fairly extensive README on github.
metrics.tsv should contain all the statistics, right? I am having hard time to read that file. Would it be possible to make it a text file that should contain like the following
....
.....
#Mapped 196263
MappedFraction 0.31431
#MappedRead1 196263
.....
.....
Thanks for your alfred tool. I have computed the mismatch rate and error rates of my long-read alignment. Mismatch rate seems to be number of mismatches / number of aligned bases. How do you define the error rate? I have an error rate of 11.4%. What does that mean? And how do you define the insertion and deletion rate? Is one wrong insertion counted as one and then you divide the total by the number of aligned bases? If there are two consecutive wrongly inserted bases, do you count that as two errors? Thanks.
BBMap's Reformat tool can produce some of these statistics:
ehist=<file> Errors-per-read histogram.
qahist=<file> Quality accuracy histogram of error rates versus quality score.
indelhist=<file> Indel length histogram.
mhist=<file> Histogram of match, sub, del, and ins rates by read location.
ihist=<file> Insert size histograms. Requires paired reads interleaved in sam file.
idhist=<file> Histogram of read count versus percent identity.
BBMap also prints out a summary of match, mismatch, insertion, and deletion rates when it runs. But I think you can get most of what you want with Reformat, particularly, the mhist output.
Thanks Brian !! So I can use my existing BAM/SAM files with BBMAP? Could you please help me with the right command?
I am trying this
reformat.sh in=mapped.sam ehist=ehist.txt
But I get a blank file.
That command should be fine. You need to have MD tags in the sam file, but those should be present by default with bwa. I just ran this on a bwa-produced bam file to make sure, and it worked as expected:
reformat.sh in=502930_1127296.bam ehist=ehist.txt
java -ea -Xmx200m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ jgi.ReformatReads in=502930_1127296.bam ehist=ehist.txt
Executing jgi.ReformatReads [in=502930_1127296.bam, ehist=ehist.txt]
Set error histogram output to ehist.txt
No output stream specified. To write to stdout, please specify 'out=stdout.fq' or similar.
Found samtools 1.4
Input is being processed as unpaired
Input: 18069294 reads 1824440174 bases
Output: 18069294 reads (100.00%) 1824440174 bases (100.00%)
Time: 38.741 seconds.
Reads Processed: 18069k 466.41k reads/sec
Bases Processed: 1824m 47.09m bases/sec
Contents of the file:
#Errors Count
0 16190186
1 1414068
2 197871
3 65303
4 31407
5 18726
6 12160
7 8348
8 5873
9 4199
10 2797
11 1875
12 1266
13 742
14 422
15 166
16 40
17 1
18 1
Note that you can write all the histograms at once if you want. Can you run "tail" on the sam file so I can see what a few reads look like?
0d6893dc-57eb-4eaa-a1ad-406b4528cc6f 0 c12692/f5p5/989 132 60 10S12M1D30M2I9M1I6M2D39M1I10M2I6M1D17M1D12M1D1M2D11M1D10M1I12M1I67M4I21M2D24M2I11M3D18M1I24M1D26M1D11M5D26M1I8M1D2M1D22M1D4M2D25M1D8M1D2M2D33M1D15M4I12M1D22M2I23M2D6M1D4M2D10M1I25M3I5M3I2M1I7M1I22M4I10M3I11M6I4M1I11M1I5M2I4M2I3M1I4M3I3M2I1M4I8M1D7M1I11M2I7M1I17M2I6M1I5M1I1M1I2M2I8M5I24M51S * 0 0 CATTACGGCCGGAGGGGCAAGCAAAGATTGTGCTCAAGTTGCCCCTGGATGACCCGAGAGGAAGGAGGAGAGGCCTTTCAGGCTGCTGTGGGAATGAACGGTGTGACATCCCGCCACGATGCGGAGGGGACAAGATCGTCGTCGTGGCGACGGCGTCATTATCACGCTCCCACAATAATATCAGCGCAGCCTCCGGCTTTACGGAGCTGCTCAGCGTCAGCTCCGGCGACGACAAGAAGAAGGGCATTGGGTATGGCTACTGCAGGGAGCAGCGGTGGCATGATGCAACGGCGGTGGCAAAGAGGGCAACGGGAGAGCAAAGGCAGCGGCGGTGGTGGCAGAGAGCAGTGGAGGCTACGGCGGCCGTACCAGGCGATGCCGCCGGTCTCATCCCTGCCTACCTGCAACGCCATGCCGTCGTCCTACCCACTGCGTCTCTCTATCCTGCAGCATACCCGCAGAACAAGGACCCTGGATGCAGCATCATGTGAAAAACATCTACGTTAATCTGAGGAGTTGATTGAAGGTGGCCGAAACATGGAGATCAGATGAATTAACATGGTTGATTCGTCAGGACAAGATGATCCACATGCTGCCTTGAGCTGTGCAACAGGACGTGGTGGGATGTGATCGCCCAGTTGTGAACATTATCATCCGTTATGGAATCTATCAATGTGCAAGGGTTGGCCAACAAACTTGACAGATGGATGAGGTGATTAGAACAAGGTTTACAGCTGAAGGAGTGGAGGCGTTTGTTGCGCCTTGTGAGTTCAAAGAGTTGGACTTCAAATTAATAGCGGCAAACGGTGTTTCTATTTTTCGTTTTCTTTCGGTTTTGGTGGAAACAAGGTTGTGAGGTCAGTTTGTTGGAAGAAAGAAAAGAAAAGAAAGAAAAAGAAAAATATCTCGAAAAGAAAAGAAAAGAAGATGGAGCGACAGACCGTGGGTT * NM:i:170 MD:Z:3T8^A45^GA7C0A28G17^C17^C12^G1^CC0C0C9^A1A2C2G0C1C1G14A0C0G44G0A0C18G14^TA1G0G32^AGA2G39^A0T23G1^A11^AGCAG1A25G0C5^A2^C22^C2G1^AC25^A8^C2^AC33^A27^G1G43^AG4T1^A2A1^CC0A11T0T8G14T4T4A12G14T5G5G0A40G1C3^G4A3G1A7C15T37G1T0T1C2T4T2 AS:i:524 XS:i:27
d6da0f06-2394-4203-9190-057434731910 0 c74405/f20p10/1109 181 2 12S11M3D3M1I3M1D23M1D7M2I17M1D35M2I18M1I34M1D18M2D24M1D12M1D12M1D6M2D3M5D1M2D5M1D12M1I2M1I25M1I17M2I10M1I3M3I3M1D17M1D23M1I31M2D21M1I25M5I5M2I28M2I5M1I9M2D9M3D19M1I29M1D19M1D43M1D6M1I6M1I11M1I39M3I2M1I6M1I16M1D6M1D9M1I10M1D20M1D8M1D5M4D2M1D3M1I7M1I10M1I12M1I13M2I10M2I10M1I8M3I2M2I8M1D2M1I4M1I12M1I4M1D9M1I14M1I10M2I5M54S * 0 0 CATTACGGCCGGGGGACACGCAACAGTCAGGGAGGAGGACGGGCGGCGGCGGCTGGAATCCAGCTCAAGCGTCTGGGTCCTTCTCACCCGCTGTACCGTCCCTCAGGCAGCATGCATGTGCCTCACCCTATGCTCCCACGCTGGCCTTCTCTGCGCTTGGTGCATACCTCACATCCTCCTCAACGTCAGGCACCCTCACGACCATGGGATGCTGGCCTCCATCCTTCCCTCATCTCCTGCCCTTGCAGGAAGGAACCACTTCAGCAGCTGCTCATATCTGCCTTGCTCCTTCCAAGGCGCGTCATTTGGCCCCCGCTCGTCTGACGCACATTTATTGACTTCGATCAGGGATTCTTGTCACGGCGTTTGTCAGGAACAGCGGTTGCTTTTGGATGCTTCTCTGGCTGCCATCATCGCCAAGCGCAAGAAATTACCTGTACCTCGGCAGTCTCTTTTGCTTCATCTTCTGGCCTCTCCATTCTTCTCTGGCTGTGCAGCTTTTCCGCTGATCTTGTGCACCAACACGACCTTCATGTTTTGAGCTCTACCTGGGCCTCCTGGACTTTTTGGATATATAGTGTTTGTACCCAGGAGATCATCAGGAGGGCGCACCCAGGGGACCTGGACACATCAGAGCGCGCCTTTGACTCTCATCCTGGGGCTTCGTTGCGGTTCTTGTTCAGATCCTTGTCACAATCCATGATGGAAGAGTGCACAGGAGAATCAGAGACAGGAAGGAAGAGGAAGGGCGGTCTTCACTGGATGGATGGATGGCTGGGATCTGAGAACTATGTTTATTGGGGTCCACTACTACTACTTTCTACTACTAGTGCACCATGTACTCACCAATCCTATAATGCTGTATGCCTACACTAAAGAGACAAGAGAGCTTTGAGCGTAATATTCGATGACTATTTGGGGATAGATATTTTAGTTCTGCACTGTTAAAATCATCAAAATCGTCAAAAAGAGAAGGAATGATGGGAGGCGACAGATAGCA * NM:i:192 MD:Z:11^TCG4G1^A23^T24^A1A13G1A5A1C0T0C0A1A7A16G31^C18^GG18A5^C12^G1C1T8^C6^CG3^CACGG1^AC5^G7G15G6G0C18C0G18T1^G15C1^A28G2A22^GC23G0G1G15G10A33G1T0A2^TC0A5T0G1^CCA7G36T0T2^C0C1G8G7^A0C15G0A11G0G6A5^T9A3G0C9T1A0C0C1A20G0A22A10^A4C1^A4G0A13^A4T1A0G0C0T0G3A3A1^T6A1^A0C1A2^CAAC2^T12G20G23T0G6G1A15^C16A1G1G1^A0C8G5A2G0A6G11 AS:i:527 XS:i:522 XA:Z:c18552/f2p10/1085,+123,12S11M3D3M1I3M1D23M1D7M2I17M1D35M2I18M1I34M1D18M2D24M1D12M1D12M1D6M2D3M5D1M2D5M1D12M1I2M1I25M1I17M2I10M1I3M3I3M1D17M1D23M1I31M2D21M1I25M5I5M2I28M2I8M1D16M3D19M1I29M1D19M1D43M1D6M1I6M1I11M1I39M3I2M1I6M1I16M1D6M1D9M1I10M1D20M1D8M1D5M4D2M1D3M1I7M1I10M1I12M3I1M1I9M2I10M2I10M1I8M3I2M2I8M1D2M1I4M1I12M1I4M1D9M1I14M1I10M2I5M54S,194;c76099/f2p9/973,+63,12S11M3D3M1I3M1D23M1D7M2I17M1D35M2I18M1I34M1D18M2D24M1D12M1D12M1D6M2D3M5D1M2D5M1D12M1I2M1I25M1I17M2I10M1I3M3I3M1D17M1D23M1I31M2D21M1I25M5I5M2I28M2I8M1D16M3D19M1I29M1D19M1D43M1D6M1I6M1I11M1I39M3I2M1I6M1I16M1D6M1D9M1I10M1D20M1D8M1D5M4D2M1D3M1I7M1I10M1I12M3I1M1I9M2I10M2I10M1I8M3I2M2I8M1D2M1I4M1I12M1I4M1D9M1I17M69S,190;