Question: Mismatch and Indel statistics from BAM/SAM file
gravatar for sbdk
3.7 years ago by
United States
sbdk60 wrote:

I am trying to find some statistics of mismatches and indels from SAM/BAM file. The SAM file is generated using BWA. The statistics should include the %mismatch and %indel for each aligned reads. I am wondering if there are any good tools I could use.

sam bam indel • 5.0k views
ADD COMMENTlink modified 2.1 years ago by FatihSarigol190 • written 3.7 years ago by sbdk60
gravatar for trausch
3.7 years ago by
trausch1.5k wrote:

You can also try alfred. It needs a sorted & indexed BAM file and the reference genome you used for the alignment.

alfred qc -r <reference.fa> <align.bam>

It computes the error rates you are looking for and some other metrics (insert size, coverage, ...).

ADD COMMENTlink modified 2.4 years ago • written 3.7 years ago by trausch1.5k

Is there an executable? I am getting error while compiling.

ADD REPLYlink written 3.4 years ago by sbdk60

Yes, there are statically compiled binaries available here and we did run it previously on Nanopore, Illumina and PacBio reads but if you experience any problems please let me know.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by trausch1.5k

It worked fine except the BAM files generated by LAST

ADD REPLYlink written 3.4 years ago by sbdk60

Looks nice, did you announce it at the Tools section?

ADD REPLYlink written 3.4 years ago by h.mon32k

Thanks, I have not created a Tools page in Biostars but there is a fairly extensive README on github.

ADD REPLYlink written 3.4 years ago by trausch1.5k

metrics.tsv should contain all the statistics, right? I am having hard time to read that file. Would it be possible to make it a text file that should contain like the following

#Mapped             196263
MappedFraction  0.31431
#MappedRead1    196263
ADD REPLYlink written 3.4 years ago by brs111110

It is a tab-delimited text file. You can use datamash to convert it to row-format:

cat outprefix.metrics.tsv | datamash transpose | column -t

The column-format is useful if you want to compare statistics across multiple samples because you can just concatenate the metrics files.

ADD REPLYlink written 3.3 years ago by trausch1.5k

Thanks for your alfred tool. I have computed the mismatch rate and error rates of my long-read alignment. Mismatch rate seems to be number of mismatches / number of aligned bases. How do you define the error rate? I have an error rate of 11.4%. What does that mean? And how do you define the insertion and deletion rate? Is one wrong insertion counted as one and then you divide the total by the number of aligned bases? If there are two consecutive wrongly inserted bases, do you count that as two errors? Thanks.

ADD REPLYlink written 19 months ago by cristian260

The InDel size doesn't matter as discussed in this Alfred issue.

ADD REPLYlink written 19 months ago by trausch1.5k
gravatar for Brian Bushnell
3.7 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

BBMap's Reformat tool can produce some of these statistics:

ehist=<file>            Errors-per-read histogram.
qahist=<file>           Quality accuracy histogram of error rates versus quality score.
indelhist=<file>        Indel length histogram.
mhist=<file>            Histogram of match, sub, del, and ins rates by read location.
ihist=<file>            Insert size histograms.  Requires paired reads interleaved in sam file.
idhist=<file>           Histogram of read count versus percent identity.

BBMap also prints out a summary of match, mismatch, insertion, and deletion rates when it runs. But I think you can get most of what you want with Reformat, particularly, the mhist output.

ADD COMMENTlink written 3.7 years ago by Brian Bushnell17k

Thanks Brian !! So I can use my existing BAM/SAM files with BBMAP? Could you please help me with the right command?

I am trying this in=mapped.sam ehist=ehist.txt

But I get a blank file.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by sbdk60

That command should be fine. You need to have MD tags in the sam file, but those should be present by default with bwa. I just ran this on a bwa-produced bam file to make sure, and it worked as expected: in=502930_1127296.bam ehist=ehist.txt
java -ea -Xmx200m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ jgi.ReformatReads in=502930_1127296.bam ehist=ehist.txt
Executing jgi.ReformatReads [in=502930_1127296.bam, ehist=ehist.txt]

Set error histogram output to ehist.txt
No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Found samtools 1.4
Input is being processed as unpaired
Input:                          18069294 reads                  1824440174 bases
Output:                         18069294 reads (100.00%)        1824440174 bases (100.00%)

Time:                           38.741 seconds.
Reads Processed:      18069k    466.41k reads/sec
Bases Processed:       1824m    47.09m bases/sec

Contents of the file:

#Errors Count
0   16190186
1   1414068
2   197871
3   65303
4   31407
5   18726
6   12160
7   8348
8   5873
9   4199
10  2797
11  1875
12  1266
13  742
14  422
15  166
16  40
17  1
18  1

Note that you can write all the histograms at once if you want. Can you run "tail" on the sam file so I can see what a few reads look like?

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Brian Bushnell17k
ADD REPLYlink modified 3.4 years ago by GenoMax94k • written 3.4 years ago by sbdk60

I am still getting a blank file. Please check the last two lines of sam file

ADD REPLYlink written 3.4 years ago by sbdk60

Is d6da0f06-2394-4203-9190-057434731910 an ID from Nanopore or some other sequencing tech?

ADD REPLYlink written 3.4 years ago by GenoMax94k

Yes, it is from nanopore. I am trying to align naopore reads against both illumina and Pacbio contigs

ADD REPLYlink written 3.4 years ago by sbdk60

Is this the mismatch profile for every read or the error rate that is originated from the sequencing? Thanks!

ADD REPLYlink written 2.1 years ago by Dogancan30
gravatar for FatihSarigol
2.1 years ago by
FatihSarigol190 wrote:

Try Qualimap2

Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data

ADD COMMENTlink written 2.1 years ago by FatihSarigol190
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1481 users visited in the last hour