samtools (conversion from .sam to .bam) and sorting of .bam file
2
5
Entering edit mode
7.2 years ago
ravi.uhdnis ▴ 200

Hi Everyone,

I have WGS data of 5 human samples from illumina HiSeq 2500 platform with PE sequening with read length 101bp. The average size of one sample .sam file is 260GB and i am ran following command to change .sam to .bam format:

/usr/local/samtools-1.2/samtools view -b  ETH002102.bwa.sam > ETH002102.bwa.sam.bam

Is there any way to fasten this conversion as it normally taking approx. 18-20 hours ?.

My 2nd question is related with 'SORTING' of .bam file, for that purpose i ran the following command:

/usr/local/samtools-1.2/samtools sort -@ 14 -m 5000000000 -T /tmp/ETH002102.bwa.sam.sort -o ETH002102.bwa.sam.sort.bam ETH002102.bwa.sam.bam

I also tried '-m 5G' but i am not able to get the output. The command run for few minutes and then automatically killed by the clusters. I am not sure where is the error so it would be great if somebody will help me in giving correct command to run on clusters in multiples cores of a node (e.g. here i gave 14 as i had 16) i.e. in parallel mode. Since i am new in this area and doing such work ist time so any help will be highly appreciated. Thanks and Regards, Ravi

next-gen genome alignment sequencing • 25k views
1
Entering edit mode

Hi, the indicates the maximum required memory per thread. By running the job on 14 threads probably you are running out of memory and this causes the crash. Try to rerun with less threads and default/less memory.

9
Entering edit mode
7.2 years ago
1. Avoid writing the unsorted BAM file to disk: samtools view -u alignment.sam | samtools sort -@ 4 - output_prefix
2. The -m option given to samtools sort should be considered approximate at best. It's probably best to assume that samtools will actually use ~2.5x that per-core. So, you can expect this to use ~175gigs of RAM. Are you able to allocate that much?
3. Realistically speaking, this is likely to be an IO bound process. Yes, throwing a few more cores at it will speed things up a bit, but you'll quickly run into diminishing returns.

BTW, you could also use the equivalent tool from Picard. The sortSam function will produce a sorted BAM file from a SAM file and write the index in a single go.

3
Entering edit mode

As an alternative to Picard, you could also try out this fork of samtools, which dramatically speeds up the sorting by outsourcing it to Google's RocksDB.  I use this routinely in place of the normal samtools sort.

1
Entering edit mode

Hi, thank you for comment. It would be great if you elaborate i a bit how to use fork of samtools?. Does one has to start with installing 'samtools-dnanexus.zip' ?. thanks

0
Entering edit mode

That fork looks great!

3
Entering edit mode

In combination with some fast SSD scratch space (or if RAM allows, a RAMdisk on tmpfs) it rocks my socks :D

0
Entering edit mode
1
Entering edit mode

samtools sort doesn't work with these arguments it prints the help/usage message

2
Entering edit mode

It needs to be changed a bit:

samtools view -u file.sam | samtools sort -o file_sorted

1
Entering edit mode

samtools sort does not run with this settings and prints the usage instructions. It is probably the issue with piping... Not sure if it's become a recent issue.

0
Entering edit mode

Hi Devon, thank you for your comment. here is the output of computational power that is available to us (using qhost):

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
master                  lx24-amd64      8  0.00   47.2G    7.3G   96.0G  280.0K
node1                   lx24-amd64      8  0.00   47.2G  124.5M     0.0     0.0
node10                  lx24-amd64     16  1.00   47.2G   16.8G     0.0     0.0
node11                  lx24-amd64      8  0.00   47.2G    2.8G     0.0     0.0
node12                  lx24-amd64      8  0.00   47.2G    6.5G     0.0     0.0
node13                  lx24-amd64      8  0.00   47.2G    6.5G     0.0     0.0
node14                  lx24-amd64      8  0.00   47.2G    2.8G     0.0     0.0
node15                  lx24-amd64      8  0.00   47.2G  121.0M     0.0     0.0
node2                   lx24-amd64      8  0.00   47.2G  122.9M     0.0     0.0
node3                   lx24-amd64      8  0.00   47.2G  122.2M     0.0     0.0
node4                   lx24-amd64      8  0.00   47.2G  121.2M     0.0     0.0
node5                   lx24-amd64      8  0.00   47.2G  120.9M     0.0     0.0
node6                   lx24-amd64      8  0.00   47.2G  120.8M     0.0     0.0
node7                   lx24-amd64      8  0.00   47.2G  121.8M     0.0     0.0
node8                   lx24-amd64      8  0.00   47.2G  121.2M     0.0     0.0
node9                   lx24-amd64      8  0.00   47.2G  121.2M     0.0     0.0

Each processor [Intel(R) Xeon(R) CPU X5550  @ 2.67GHz] has 4 cores. Therefore, i believe this much computational power is sufficient to handle the WGS data analysis of human samples (currently we have 5 going to be much more in incoming days), presenty i am running a single command on a single node e.g. on node8 i tried following command:

/usr/local/samtools-1.2/samtools sort -@ 4 -m 5G -T /tmp/ETH002102.bwa.sam.sort -o ETH002102.bwa.sam.sort.bam ETH002102.bwa.sam.bam

with -m option (ranging from 2, 2.5, 3, 4) but it didn't worked out. I have 5 such samples with .bam file (with size 65GB to 75GB) but i'm unable to sort these files using samtools. Once, i'll finalize all my workable commands, my target will be to form an automatized pipeline for this work. Going to use picards now...any suggestion will be highly appreciated. Thanks and Regards, Ravi

1
Entering edit mode
7.2 years ago
jshall ▴ 40

You might be able to reduce that by using fast compression (samtools view -1) and/or using multiple compression threads with -@. However it's quite likely that disk IO will be a limiting factor with a 260 GB sam file so don't expect huge gains.

How much RAM do you have for you on your cluster? If you're using some kind of queuing system you may be able to request more than the default, but if it's a hardware limit you're stuck.

Also, as Nicola says, that 5G is per thread. Generally with sorting you experience diminishing returns with increasing threads, I did some benchmarking a while back with samtools sort and gains basically stalled out at 8 threads (and 8 was only like 30% faster than 4). So you can probably reduce memory usage just by reducing the number of threads. One caveat I have noticed though is that samtools sometimes peaks above the amount of RAM it's supposed to be using, which has gotten my jobs killed before.

Finally, if you're always going to be converting from sam to bam and then sorting, you can pipe the output of view directly to sort, which could potentially reduce IO bottlenecks. If you're doing this it makes sense to give view the -u flag, saying no to compress the bam file. ie something like 'samtools view -b -u example.sam | samtools sort -@ 8 -m 5G -T tmp -o out.bam'

0
Entering edit mode

Hi, Thank you for your comment. I just explained above the computational power accessible to us, which i think is good to handle the jobs that i am working on right now. presently i am running single command on a node to get the output, once i'll reach upto the final output...my objective will be to develop a pipeline to do all this work in systematic manner using shell script.

Moreover, i tried your suggestion as converting .sam to sorted .bam (bypassing the .bam file using pipe) but there also i didn't get the required output. The command run but subsequently it got killed after some time.....i am sure i am missing some factor while optimizing the correct assignment of "thread and memory" to them. I tried -m option values as 2, 2.5, 3, 4 and 5 though it didn't worked out for me. Looking for alternative, as going to try using picards now. Any suggestion will be valuable...thanks and regards, Ravi

0
Entering edit mode

I strongly suspect that your issue is that you aren't requesting enough memory from your cluster's queueing system. Most shared use clusters will limit the amount of ram they allocate by default, you have to actively request to go over that limit in your qsub/qlogin/qsh/whatever command. Unfortunately there are a bunch of different ways this can be configured. Does your cluster have a user guide or a sysadmin you could consult?