samtools depth and sambamba depth produce different results
0
0
Entering edit mode
7 months ago
Johannes • 0

I would like to calculate some very general summary statistics on bam files: average read depth and breadth of coverage at >= 1x and >10x. I am using samtools and sambamba to calculate depth and they seem to produce differing results as evidenced by the summary files below.
According to the documentation, these are the default filters:

• samtools depth filters reads that have any of the flags UNMAP, SECONDARY, QCFAIL, or DUP (-g option)
• sambamba depth filters reads that have 'mapping_quality > 0 and not duplicate and not failed_quality_control' (-F option)

For the analysis I am doing, I want to be stringent an exclude any multimappers, low quality alignments, duplicates or secondary alignments. Therefore, I have specified the filter options in both commands as shown below.

Here is my samtools command:

samtools depth -a -Q 1 test.bam > test.genomeCoverage


This is the head of the output:

chr1    1   5
chr1    2   5
chr1    3   5
chr1    4   5
chr1    5   5
chr1    6   5
chr1    7   5
chr1    8   5
chr1    9   5
chr1    10  5


Here is my sambamba command:

sambamba depth base -c 0 -t 8 -F "mapping_quality > 0 and not (duplicate or failed_quality_control or unmapped or secondary_alignment)" -o test.genomeCoverage test.bam


This is the head of the resulting file:

REF POS COV A   C   G   T   DEL REFSKIP SAMPLE
chr1    0   5   0   5   0   0   0   0   *
chr1    1   5   0   0   0   5   0   0   *
chr1    2   5   5   0   0   0   0   0   *
chr1    3   5   5   0   0   0   0   0   *
chr1    4   5   0   5   0   0   0   0   *
chr1    5   5   0   5   0   0   0   0   *
chr1    6   5   0   5   0   0   0   0   *
chr1    7   5   0   0   0   5   0   0   *
chr1    8   5   5   0   0   0   0   0   *


I then produce summary statistics from that 'test.genomeCoverage' file with the following code:

echo -n -e "Average read depth across genome:\t"  > ${o}_summaryStats cat test.genomeCoverage | awk '{sum+=$3} END {print sum/NR}' >> ${o}_summaryStats echo -n -e "% genome (breadth) covered >0x:\t" >>${o}_summaryStats
cat test.genomeCoverage | awk '{c++; if($3>0) total+=1}END{print (total/c)*100}' >>${o}_summaryStats
echo -n -e "% genome (breadth) covered >10x:\t" >> ${o}_summaryStats cat test.genomeCoverage | awk '{c++; if($3>10) total+=1}END{print (total/c)*100}' >> \${o}_summaryStats


The output for the file produced with samtools is this:

Average read depth across genome:   84.1805
% genome (breadth) covered >0x: 97.7017
% genome (breadth) covered >10x:    94.7993


The output for the file produced with sambamba is this:

Average read depth across genome:   84.3173
% genome (breadth) covered >0x: 97.7089
% genome (breadth) covered >10x:    94.8096


Obviously, the difference is small but still I am confused and would like to understand where it is coming from. I would be grateful for any suggestions or hints as to how I could check where that difference is coming from.

Context: this is a small benchmarking project of Nanopore vs Illumina sequencing which is why I want to get these stats right. I would expect Nanopore to have a higher breadth because of the better coverage in repetitive regions. This is also the reason why I want a stringent filter which excludes any multimappers or reads that are in any way ambiguously aligned.

sambamba samtools depth coverage • 599 views