samtools depth and sambamba depth produce different results
0
0
Entering edit mode
7 months ago
Johannes • 0

I would like to calculate some very general summary statistics on bam files: average read depth and breadth of coverage at >= 1x and >10x. I am using samtools and sambamba to calculate depth and they seem to produce differing results as evidenced by the summary files below.
According to the documentation, these are the default filters:

  • samtools depth filters reads that have any of the flags UNMAP, SECONDARY, QCFAIL, or DUP (-g option)
  • sambamba depth filters reads that have 'mapping_quality > 0 and not duplicate and not failed_quality_control' (-F option)

For the analysis I am doing, I want to be stringent an exclude any multimappers, low quality alignments, duplicates or secondary alignments. Therefore, I have specified the filter options in both commands as shown below.

Here is my samtools command:

samtools depth -a -Q 1 test.bam > test.genomeCoverage

This is the head of the output:

chr1    1   5
chr1    2   5
chr1    3   5
chr1    4   5
chr1    5   5
chr1    6   5
chr1    7   5
chr1    8   5
chr1    9   5
chr1    10  5

Here is my sambamba command:

sambamba depth base -c 0 -t 8 -F "mapping_quality > 0 and not (duplicate or failed_quality_control or unmapped or secondary_alignment)" -o test.genomeCoverage test.bam

This is the head of the resulting file:

REF POS COV A   C   G   T   DEL REFSKIP SAMPLE
chr1    0   5   0   5   0   0   0   0   *
chr1    1   5   0   0   0   5   0   0   *
chr1    2   5   5   0   0   0   0   0   *
chr1    3   5   5   0   0   0   0   0   *
chr1    4   5   0   5   0   0   0   0   *
chr1    5   5   0   5   0   0   0   0   *
chr1    6   5   0   5   0   0   0   0   *
chr1    7   5   0   0   0   5   0   0   *
chr1    8   5   5   0   0   0   0   0   *

I then produce summary statistics from that 'test.genomeCoverage' file with the following code:

echo -n -e "Average read depth across genome:\t"  > ${o}_summaryStats
cat test.genomeCoverage | awk '{sum+=$3} END {print sum/NR}' >> ${o}_summaryStats
echo -n -e "% genome (breadth) covered >0x:\t" >> ${o}_summaryStats
cat test.genomeCoverage | awk '{c++; if($3>0) total+=1}END{print (total/c)*100}' >> ${o}_summaryStats
echo -n -e "% genome (breadth) covered >10x:\t" >> ${o}_summaryStats
cat test.genomeCoverage | awk '{c++; if($3>10) total+=1}END{print (total/c)*100}' >> ${o}_summaryStats

The output for the file produced with samtools is this:

Average read depth across genome:   84.1805
% genome (breadth) covered >0x: 97.7017
% genome (breadth) covered >10x:    94.7993

The output for the file produced with sambamba is this:

Average read depth across genome:   84.3173
% genome (breadth) covered >0x: 97.7089
% genome (breadth) covered >10x:    94.8096

Obviously, the difference is small but still I am confused and would like to understand where it is coming from. I would be grateful for any suggestions or hints as to how I could check where that difference is coming from.

Context: this is a small benchmarking project of Nanopore vs Illumina sequencing which is why I want to get these stats right. I would expect Nanopore to have a higher breadth because of the better coverage in repetitive regions. This is also the reason why I want a stringent filter which excludes any multimappers or reads that are in any way ambiguously aligned.

sambamba samtools depth coverage • 599 views
ADD COMMENT

Login before adding your answer.

Traffic: 2296 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6