Question

Bamtools stats and samtools flagstat produce different results and confusing my sambamba output.

0

Entering edit mode

2.9 years ago

DNAngel ▴ 250

A bit of a confusing one here. First I mapped my fastq reads using BWA, converted my sam files to bam, sorted and indexed them. Next, I obtained some stats and I wanted to use two different programs so I ran the following:

bamtools stats -in sample_sorted.bam -insert > sample_sorted_bamtools.stats.txt
samtools flagstat -t 16 "sample_sorted.bam" > "sample_sorted_samtools.stats.txt"  # quotes needed for this, don't know why but wouldn't work otherwise

However, the stat outputs for mapped reads from each are different for some reason. Same with singletons. Below is an example for sample1_sorted.bam

BAMTOOLS
Total reads:       29075080
Mapped reads:      28157243     (96.8432%)
Forward strand:    14993199     (51.5672%)
Reverse strand:    14081881     (48.4328%)
Failed QC:         0    (0%)
Duplicates:        0    (0%)
Paired-end reads:  29075080     (100%)
'Proper-pairs':    26900935     (92.5223%)
Both pairs mapped: 28148319     (96.8125%)
Read 1:            14538803
Read 2:            14536277
Singletons:        8924 (0.030693%)
Average insert size (absolute value): 878.48
Median insert size (absolute value): 230

SAMTOOLS
29075080 + 0 in total (QC-passed reads + QC-failed reads)
155590 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
28157243 + 0 mapped (96.84%:N/A)
28919490 + 0 paired in sequencing
14459745 + 0 read1
14459745 + 0 read2
26852874 + 0 properly paired (92.85%:N/A)
27993140 + 0 with itself and mate mapped
8513 + 0 singletons (0.03%:N/A)
1026542 + 0 with mate mapped to a different chr
308644 + 0 with mate mapped to a different chr (mapQ>=5)

Then I used sambamba to separate mapped and unmapped reads from my fastq files with the following commands:

# For all R1 and R2 reads do the following:

 # Get all unmapped reads 
sambamba view -f bam -F 'unmapped and mate_is_unmapped' -t 16 sample_sorted.bam
sambamba sort -t 16 -o sample_unmapped_sorted.bam

 # Get all mapped reads
sambamba view -f bam -F 'not (unmapped or mate_is_unmapped) and paired' -t 16 sample_sorted.bam
sambamba sort -t 16 -o sample_mapped_sorted.bam

I then used samtools view -c to see how many reads are in each of the sorted bamfiles to check that they equal the initial amount of reads (which should be 29075080 for this sample).

samtools view -c sample_mapped_sorted.bam
# output = 28148319
samtools view -c sample_unmapped_sorted.bam
# output = 909324

Total of mapped and unmapped reads in bamfiles = 29,057,643 which is 17,437 reads short. Furthermore, if I add the singletons from the bamtools stat file (8924 reads) and the samtools stat file (8513 reads), they equal 17,437. Why is it just not showing as 17,437 in both stat outputs? Why are there discrepancies in the two stat files, why would singletons from both stat files equal the amount of missing reads from my mapped and unmapped files, and why am I missing reads? Is this an issue of sambamba? Did I use an incorrect flag for sambamba?

Thank you for your help.

sambamba bamtools samtools • 2.8k views

ADD COMMENT • link 2.9 years ago by DNAngel ▴ 250

score 2 · Answer 1 · 2021-05-12

2

Entering edit mode

2.9 years ago

Jorge Amigo 14k

Adding the extra 'and paired' condition in the second sambamba view filters out a few reads, therefore you are not adding all possible reads but leaving out the 'not (unmapped or mate_is_unmapped) and not paired' ones. Try removing it.

ADD COMMENT • link 2.9 years ago by Jorge Amigo 14k

0

Entering edit mode

If I want to extract all mapped reads, can I just change it to 'paired and proper_pair' instead? If I leave it out, I will get all reads including unmapped but I need those as separate files. Or did you mean simply just remove the "and paired" part of the line?

ADD REPLY • link 2.9 years ago by DNAngel ▴ 250

1

Entering edit mode

What I meant was that the complement of 'unmapped and mate_is_unmapped' is not 'not (unmapped or mate_is_unmapped) and paired', therefore the sum of both read sets won't be equal to the total number of reads. If you just remove the 'and paired' part of the line you'd still be considering reads from the 'unmapped and mate_is_unmapped' set, as the complement of 'unmapped and mate_is_unmapped' is simply 'not (unmapped and mate_is_unmapped)'.

If you want mapped reads on one side and unmapped reads on other you can do it in a much simpler way:

samtools view -bF4 file.bam > file.mapped.bam
samtools view -bf4 file.bam > file.unmapped.bam

ADD REPLY • link 2.9 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks I did try that but samtools is slow, hence why I was opting to use sambamba. Thank you though for the explanation!

ADD REPLY • link 2.9 years ago by DNAngel ▴ 250

0

Entering edit mode

If speed is an issue you can use threads (-@ option) in order to improve samtools' performance:

samtools view -bF4 -@ 8 file.bam > file.mapped.bam
samtools view -bf4 -@ 8 file.bam > file.unmapped.bam