Question: samtools covert to bam, sort and index all gz.sam
gravatar for bioguy24
10 months ago by
bioguy24190 wrote:

The below bash script utilizes samtools in parellel to convert all gz.sam in a directory, sort, and index (I think). However, it seems to be processing a long time and I am not sure if it is indexed. Is there a better, more-efficient way? Without the .gz it is much faster. I am using samtools 1.9. Thank you :).

cd "$dir"
x=$(ls -dq *.sam* | wc -l)
echo "Starting conversion of" $x "sam files on" $(date) >> "$logfile"
ls *.sam | parallel "samtools view -b -S {} | samtools sort - {.}"
echo "conversion of" $x "sam files complete and coverted to sorted bam on" $(date) >> "$logfile"
samtools parellel • 834 views
ADD COMMENTlink modified 10 months ago by cmdcolin1.3k • written 10 months ago by bioguy24190

Less likely that there is going to be a faster way (you are already using parallel and all cores you have access to locally?). Unless you access to a large cluster with hundreds of CPU's and a really high-performance file system where you can start all jobs at the same time.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax74k

samtools sort accept sam as input and can output in bam format. So there is no need to use samtools view before. But I don't know whether this is time consuming.

Furthermore there is the -@ for using multiple threads. But again I don't know if you can save a lot of time with this. The bottleneck is the sorting itself.

fin swimmer

ADD REPLYlink written 10 months ago by finswimmer12k

From personal experience, sambamba runs faster. After I made the switch, I haven't gone back to benchmark samtools with the latest versions in the past couple of years, so the 2 tools might perform similarly now. Instead of ramping up all the available cores to run jobs simultaneously, providing more memory per sort will make it run a lot quicker. For instance, sort using 8 cores with 32GB of memory (4G/core) will very likely finish quicker than using 32 cores with 8GB of total memory. Also, if you have multiple storage options (i,e network-based vs instance-store in the cloud), set your temp directory to utilize the fastest storage.

ADD REPLYlink modified 10 months ago • written 10 months ago by Eric Lim1.6k
gravatar for ATpoint
10 months ago by
ATpoint26k wrote:

I found for myself that writing the code to be executed in parallel in a function makes it much cleaner rather than trying to fit the code into a one-liner. Here the code is short but once the command to be parallelized gets larger, this helps. So instead of trying to fit the code into a parallel-one-liner, rather do:

function SortAndIndex {

  ## write status a message to stderr
  (>&2 paste -d " " <(echo '[INFO]' 'Sam2Bam for' $1 'started on') <(date))

  ## Samtools sort can read SAM files directly and outputs BAM, indexing on the fly:
  samtools sort -O BAM -l 5 -o /dev/stdout $IN | tee ${1%.sam}_sorted.bam | samtools index - ${1%.sam}_sorted.bam.bai

  ## message again to stderr:
  (>&2 paste -d " " <(echo '[INFO]' 'Sam2Bam for' $1 'ended on') <(date))

}; export -f SortAndIndex

## Now run in parallel, here running 4 jobs in parallel, make sure you have enough resources:
ls *.sam | parallel -j 4 "SortAndIndex {} 2>> {.}.log"

Assuming you have a file foo.sam, the script produces foo_sorted.bam and foo_sorted.bam.bai and a log file called foo.log with the status messages plus all messages produced my the tools. The above code is of course a bit "heavy" for a simple sam2bam conversion, but I think you get the general idea.

ADD COMMENTlink modified 10 months ago • written 10 months ago by ATpoint26k
gravatar for cmdcolin
10 months ago by
United States
cmdcolin1.3k wrote:

The question "not sure if it is indexed" is probably not relevant to sorting (indexing only happens after you have sorted bam files)

Other options for speedup

ADD COMMENTlink written 10 months ago by cmdcolin1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1275 users visited in the last hour