Question

samtools covert to bam, sort and index all gz.sam

0

Entering edit mode

5.3 years ago

bioguy24 ▴ 230

The below bash script utilizes samtools in parellel to convert all gz.sam in a directory, sort, and index (I think). However, it seems to be processing a long time and I am not sure if it is indexed. Is there a better, more-efficient way? Without the .gz it is much faster. I am using samtools 1.9. Thank you :).

logfile=/path/to/fastq/process.log
dir=/path/to/fastq/
cd "$dir"
x=$(ls -dq *.sam* | wc -l)
echo "Starting conversion of" $x "sam files on" $(date) >> "$logfile"
ls *.sam | parallel "samtools view -b -S {} | samtools sort - {.}"
echo "conversion of" $x "sam files complete and coverted to sorted bam on" $(date) >> "$logfile"

samtools parellel • 5.4k views

ADD COMMENT • link updated 5.3 years ago by cmdcolin ★ 3.8k • written 5.3 years ago by bioguy24 ▴ 230

1

Entering edit mode

Less likely that there is going to be a faster way (you are already using parallel and all cores you have access to locally?). Unless you access to a large cluster with hundreds of CPU's and a really high-performance file system where you can start all jobs at the same time.

ADD REPLY • link 5.3 years ago by GenoMax 141k

1

Entering edit mode

samtools sort accept sam as input and can output in bam format. So there is no need to use samtools view before. But I don't know whether this is time consuming.

Furthermore there is the -@ for using multiple threads. But again I don't know if you can save a lot of time with this. The bottleneck is the sorting itself.

fin swimmer

ADD REPLY • link 5.3 years ago by finswimmer 16k

1

Entering edit mode

From personal experience, sambamba runs faster. After I made the switch, I haven't gone back to benchmark samtools with the latest versions in the past couple of years, so the 2 tools might perform similarly now. Instead of ramping up all the available cores to run jobs simultaneously, providing more memory per sort will make it run a lot quicker. For instance, sort using 8 cores with 32GB of memory (4G/core) will very likely finish quicker than using 32 cores with 8GB of total memory. Also, if you have multiple storage options (i,e network-based vs instance-store in the cloud), set your temp directory to utilize the fastest storage.

ADD REPLY • link 5.3 years ago by Eric Lim ★ 2.1k

score 1 · Answer 1 · 2019-01-07

You can use Unix pipes here together with tee.

alignment (...) | samtools sort -O BAM | tee sorted.bam | samtools index - sorted.bam.bai

This takes the SAM file from the alignment (or from disk), sorts (and streams as BAM), writes a copy to disk (via tee) and then pipes the stream into the indexing subcommand. Can be helpful if you want to do other things beyond indexing downstream but still want to save intermediate files here and there on the fly.

score 0 · Answer 2 · 2019-01-07

The question "not sure if it is indexed" is probably not relevant to sorting (indexing only happens after you have sorted bam files)

Other options for speedup

http://devblog.dnanexus.com/faster-bam-sorting-with-samtools-and-rocksdb/

https://www.basepairtech.com/blog/sorting-bam-files-samtools-vs-sambamba/