A bit of a confusing one here. First I mapped my fastq reads using BWA, converted my sam files to bam, sorted and indexed them. Next, I obtained some stats and I wanted to use two different programs so I ran the following:
bamtools stats -in sample_sorted.bam -insert > sample_sorted_bamtools.stats.txt samtools flagstat -t 16 "sample_sorted.bam" > "sample_sorted_samtools.stats.txt" # quotes needed for this, don't know why but wouldn't work otherwise
However, the stat outputs for mapped reads from each are different for some reason. Same with singletons. Below is an example for sample1_sorted.bam
BAMTOOLS Total reads: 29075080 Mapped reads: 28157243 (96.8432%) Forward strand: 14993199 (51.5672%) Reverse strand: 14081881 (48.4328%) Failed QC: 0 (0%) Duplicates: 0 (0%) Paired-end reads: 29075080 (100%) 'Proper-pairs': 26900935 (92.5223%) Both pairs mapped: 28148319 (96.8125%) Read 1: 14538803 Read 2: 14536277 Singletons: 8924 (0.030693%) Average insert size (absolute value): 878.48 Median insert size (absolute value): 230 SAMTOOLS 29075080 + 0 in total (QC-passed reads + QC-failed reads) 155590 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 28157243 + 0 mapped (96.84%:N/A) 28919490 + 0 paired in sequencing 14459745 + 0 read1 14459745 + 0 read2 26852874 + 0 properly paired (92.85%:N/A) 27993140 + 0 with itself and mate mapped 8513 + 0 singletons (0.03%:N/A) 1026542 + 0 with mate mapped to a different chr 308644 + 0 with mate mapped to a different chr (mapQ>=5)
Then I used sambamba to separate mapped and unmapped reads from my fastq files with the following commands:
# For all R1 and R2 reads do the following: # Get all unmapped reads sambamba view -f bam -F 'unmapped and mate_is_unmapped' -t 16 sample_sorted.bam sambamba sort -t 16 -o sample_unmapped_sorted.bam # Get all mapped reads sambamba view -f bam -F 'not (unmapped or mate_is_unmapped) and paired' -t 16 sample_sorted.bam sambamba sort -t 16 -o sample_mapped_sorted.bam
I then used samtools view -c to see how many reads are in each of the sorted bamfiles to check that they equal the initial amount of reads (which should be 29075080 for this sample).
samtools view -c sample_mapped_sorted.bam # output = 28148319 samtools view -c sample_unmapped_sorted.bam # output = 909324
Total of mapped and unmapped reads in bamfiles = 29,057,643 which is 17,437 reads short. Furthermore, if I add the singletons from the bamtools stat file (8924 reads) and the samtools stat file (8513 reads), they equal 17,437. Why is it just not showing as 17,437 in both stat outputs? Why are there discrepancies in the two stat files, why would singletons from both stat files equal the amount of missing reads from my mapped and unmapped files, and why am I missing reads? Is this an issue of sambamba? Did I use an incorrect flag for sambamba?
Thank you for your help.
If I want to extract all mapped reads, can I just change it to 'paired and proper_pair' instead? If I leave it out, I will get all reads including unmapped but I need those as separate files. Or did you mean simply just remove the "and paired" part of the line?
What I meant was that the complement of 'unmapped and mate_is_unmapped' is not 'not (unmapped or mate_is_unmapped) and paired', therefore the sum of both read sets won't be equal to the total number of reads. If you just remove the 'and paired' part of the line you'd still be considering reads from the 'unmapped and mate_is_unmapped' set, as the complement of 'unmapped and mate_is_unmapped' is simply 'not (unmapped and mate_is_unmapped)'.
If you want mapped reads on one side and unmapped reads on other you can do it in a much simpler way:
Thanks I did try that but samtools is slow, hence why I was opting to use sambamba. Thank you though for the explanation!
If speed is an issue you can use threads (
-@option) in order to improve samtools' performance:
I tried it before with samtools -@, still wayyyy too slow for some reason.