Question: Samtools mpileup taking a long time to finish
0
gravatar for williamsbrian5064
5 weeks ago by
williamsbrian506490 wrote:

Hi,

I am using samtools for variant calling. The issue I am having is that samtools mpileup seems to be taking a very long time. I am running samtools on a cluster and have a set how long samtool will run. The first time I tried this I gave it 7 days to run with 256 Gbs of RAM. The resulting file only had two chromosome of variant calling. I am working with about 40 samples and they range from 10-30 Gbs so I realize it will take some time. I am just worried I am doing something wrong here? I feel like it shouldn't take this long.

Here is the command I am using for samtool

samtools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF -C 50 --output-tags DP,DV,DPR,INFO/DPR,DP4,SP -u -f refgenome.fa sample1.bam sample40.bam > allsample.bcf

What do you think? Any suggestion on how to speed this up? I am using -C 50 because I used bwa.

ADD COMMENTlink modified 5 weeks ago by ATpoint5.7k • written 5 weeks ago by williamsbrian506490
2
gravatar for ATpoint
5 weeks ago by
ATpoint5.7k
Germany
ATpoint5.7k wrote:

Mpileup is not a speedy guy, that is true. You should check with top if it is running smoothly or if any I/O bottleneck kicks in. A second option is to use something like GNU parallel to call variants for each chromosome in parallel:

You'll need a BED file with all chromosomes for your reference genome like: chr1 0 242000000 chr2 0 240000000

Then use parallel with something like:

cat chromosomes.bed | awk 'FS="\t", OFS="" {print $1,":"$2"-"$3}' | parallel "samtools mpileup (....) -r {} sample1.bam ... sample40.bam > {}.vcf"

Passing -j <int> to parallel allows to control the number of parallel actions, if parallelizing over all chromosomes at the same time gives I/O problems.

Output files would then be, e.g. chr1:0-242000000.vcf, chr2:0-200000000.vcf. Admittedly pretty ugly, but after samtools has finished, simply cat everything:

cat chr*.vcf > catted.vcf
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by ATpoint5.7k
1

OP is using a cluster so starting parallel jobs (instead of parallel) would be the way to go (within bounds of allocated job slots). Assuming this is a high-performance compute cluster I/O should not be an issue.

ADD REPLYlink written 5 weeks ago by genomax52k
1

Requesting one node with, say 64 threads, I do not see any problem in using all the threads in one job with -j 63 leaving one thread for parallel itself. Did that for dozens of jobs.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint5.7k

I am not a parallel user so I will defer to you. Do you use a job scheduler and if so how does parallel interact with scheduler? Ultimately it is the scheduler that manages running jobs and I assume every job/process parallel spawns shows up with an individual PID. Admins frown on running jobs outside of scheduler in most places.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax52k

I see your point. I always book an entire node for these jobs, so it cannot interfere with other user's processes. probably the best would be to talk to OPs admin as you suggested above.

ADD REPLYlink written 5 weeks ago by ATpoint5.7k

It is a high-performance compute cluster

ADD REPLYlink written 5 weeks ago by williamsbrian506490

You can run samtools on each chromosome all at once? This wont give you any issues?

GNU parallel doesn't seem to bad. Seems like a valid option.

ADD REPLYlink written 5 weeks ago by williamsbrian506490
1

Don't try that on your cluster without checking with cluster admins. More than one job trying to run on a core may cause unintended consequences for job scheduler (I assume you are using one).

ADD REPLYlink written 5 weeks ago by genomax52k

Oh our lovely cluster admin. He is always so much fun to talk to :/

I will talk to him and see what he says. I think it will cause an issue. I guess I could just try just submitting jobs by chromosome? Maybe take a bit but none is really using our cluster since it's the summer here. Then just combine all the files into one VCF? Do you think that would work?

ADD REPLYlink written 5 weeks ago by williamsbrian506490
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 660 users visited in the last hour