Taking over someone else's code, I've discovered a bottleneck. This is a portion of a script that accepts a BAM and outputs smaller, sorted, rmdup-ed, indexed BAMs for each chromosome.
for chr in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M; do
samtools view -@ $threads -b $bam $chr > $bam.$chr.bam
samtools index $bam.$chr.bam
samtools rmdup $bam.$chr.bam $bam.$chr.rmdup.bam
samtools index $bam.$chr.rmdup.bam
done
After reading this Biostars question, I replaced the loop and the samtools view
with
bamtools split -in $bam -reference
Though it can't be multithreaded, this does seem—at least in my benchmarking so far—to be faster. I'm wondering, though, whether the BAMs it produces are guaranteed to be sorted and rmdup-ed if the input $bam was? Or would I still need to run samtools rmdup
(or samtools sort
) on each child BAM?