Question

Samtools mpileup taking a long time to finish

0

Entering edit mode

5.9 years ago

williamsbrian5064 ▴ 510

Hi,

I am using samtools for variant calling. The issue I am having is that samtools mpileup seems to be taking a very long time. I am running samtools on a cluster and have a set how long samtool will run. The first time I tried this I gave it 7 days to run with 256 Gbs of RAM. The resulting file only had two chromosome of variant calling. I am working with about 40 samples and they range from 10-30 Gbs so I realize it will take some time. I am just worried I am doing something wrong here? I feel like it shouldn't take this long.

Here is the command I am using for samtool

samtools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF -C 50 --output-tags DP,DV,DPR,INFO/DPR,DP4,SP -u -f refgenome.fa sample1.bam sample40.bam > allsample.bcf

What do you think? Any suggestion on how to speed this up? I am using -C 50 because I used bwa.

Assembly alignment sequence genome • 5.4k views

ADD COMMENT • link updated 5.9 years ago by ATpoint 82k • written 5.9 years ago by williamsbrian5064 ▴ 510

score 2 · Accepted Answer · 2018-06-12

2

Entering edit mode

5.9 years ago

ATpoint 82k

Mpileup is not a speedy guy, that is true. You should check with top if it is running smoothly or if any I/O bottleneck kicks in. A second option is to use something like GNU parallel to call variants for each chromosome in parallel:

You'll need a BED file with all chromosomes for your reference genome like: chr1 0 242000000 chr2 0 240000000

Then use parallel with something like:

cat chromosomes.bed | awk 'FS="\t", OFS="" {print $1,":"$2"-"$3}' | parallel "samtools mpileup (....) -r {} sample1.bam ... sample40.bam > {}.vcf"

Passing -j <int> to parallel allows to control the number of parallel actions, if parallelizing over all chromosomes at the same time gives I/O problems.

Output files would then be, e.g. chr1:0-242000000.vcf, chr2:0-200000000.vcf. Admittedly pretty ugly, but after samtools has finished, simply cat everything:

cat chr*.vcf > catted.vcf

ADD COMMENT • link 5.9 years ago by ATpoint 82k

1

Entering edit mode

OP is using a cluster so starting parallel jobs (instead of parallel) would be the way to go (within bounds of allocated job slots). Assuming this is a high-performance compute cluster I/O should not be an issue.

ADD REPLY • link 5.9 years ago by GenoMax 142k

1

Entering edit mode

Requesting one node with, say 64 threads, I do not see any problem in using all the threads in one job with -j 63 leaving one thread for parallel itself. Did that for dozens of jobs.

ADD REPLY • link 5.9 years ago by ATpoint 82k

0

Entering edit mode

I am not a parallel user so I will defer to you. Do you use a job scheduler and if so how does parallel interact with scheduler? Ultimately it is the scheduler that manages running jobs and I assume every job/process parallel spawns shows up with an individual PID. Admins frown on running jobs outside of scheduler in most places.

ADD REPLY • link 5.9 years ago by GenoMax 142k

0

Entering edit mode

I see your point. I always book an entire node for these jobs, so it cannot interfere with other user's processes. probably the best would be to talk to OPs admin as you suggested above.

ADD REPLY • link 5.9 years ago by ATpoint 82k

0

Entering edit mode

It is a high-performance compute cluster

ADD REPLY • link 5.9 years ago by williamsbrian5064 ▴ 510

0

Entering edit mode

You can run samtools on each chromosome all at once? This wont give you any issues?

GNU parallel doesn't seem to bad. Seems like a valid option.

ADD REPLY • link 5.9 years ago by williamsbrian5064 ▴ 510

1

Entering edit mode

Don't try that on your cluster without checking with cluster admins. More than one job trying to run on a core may cause unintended consequences for job scheduler (I assume you are using one).

ADD REPLY • link 5.9 years ago by GenoMax 142k

0

Entering edit mode

Oh our lovely cluster admin. He is always so much fun to talk to :/

I will talk to him and see what he says. I think it will cause an issue. I guess I could just try just submitting jobs by chromosome? Maybe take a bit but none is really using our cluster since it's the summer here. Then just combine all the files into one VCF? Do you think that would work?

ADD REPLY • link 5.9 years ago by williamsbrian5064 ▴ 510