Question: Samtools mpileup taking a long time to finish
0
gravatar for williamsbrian5064
12 months ago by
williamsbrian5064170 wrote:

Hi,

I am using samtools for variant calling. The issue I am having is that samtools mpileup seems to be taking a very long time. I am running samtools on a cluster and have a set how long samtool will run. The first time I tried this I gave it 7 days to run with 256 Gbs of RAM. The resulting file only had two chromosome of variant calling. I am working with about 40 samples and they range from 10-30 Gbs so I realize it will take some time. I am just worried I am doing something wrong here? I feel like it shouldn't take this long.

Here is the command I am using for samtool

samtools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF -C 50 --output-tags DP,DV,DPR,INFO/DPR,DP4,SP -u -f refgenome.fa sample1.bam sample40.bam > allsample.bcf

What do you think? Any suggestion on how to speed this up? I am using -C 50 because I used bwa.

ADD COMMENTlink modified 12 months ago by ATpoint17k • written 12 months ago by williamsbrian5064170
2
gravatar for ATpoint
12 months ago by
ATpoint17k
Germany
ATpoint17k wrote:

Mpileup is not a speedy guy, that is true. You should check with top if it is running smoothly or if any I/O bottleneck kicks in. A second option is to use something like GNU parallel to call variants for each chromosome in parallel:

You'll need a BED file with all chromosomes for your reference genome like: chr1 0 242000000 chr2 0 240000000

Then use parallel with something like:

cat chromosomes.bed | awk 'FS="\t", OFS="" {print $1,":"$2"-"$3}' | parallel "samtools mpileup (....) -r {} sample1.bam ... sample40.bam > {}.vcf"

Passing -j <int> to parallel allows to control the number of parallel actions, if parallelizing over all chromosomes at the same time gives I/O problems.

Output files would then be, e.g. chr1:0-242000000.vcf, chr2:0-200000000.vcf. Admittedly pretty ugly, but after samtools has finished, simply cat everything:

cat chr*.vcf > catted.vcf
ADD COMMENTlink modified 12 months ago • written 12 months ago by ATpoint17k
1

OP is using a cluster so starting parallel jobs (instead of parallel) would be the way to go (within bounds of allocated job slots). Assuming this is a high-performance compute cluster I/O should not be an issue.

ADD REPLYlink written 12 months ago by genomax68k
1

Requesting one node with, say 64 threads, I do not see any problem in using all the threads in one job with -j 63 leaving one thread for parallel itself. Did that for dozens of jobs.

ADD REPLYlink modified 12 months ago • written 12 months ago by ATpoint17k

I am not a parallel user so I will defer to you. Do you use a job scheduler and if so how does parallel interact with scheduler? Ultimately it is the scheduler that manages running jobs and I assume every job/process parallel spawns shows up with an individual PID. Admins frown on running jobs outside of scheduler in most places.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax68k

I see your point. I always book an entire node for these jobs, so it cannot interfere with other user's processes. probably the best would be to talk to OPs admin as you suggested above.

ADD REPLYlink written 12 months ago by ATpoint17k

It is a high-performance compute cluster

ADD REPLYlink written 12 months ago by williamsbrian5064170

You can run samtools on each chromosome all at once? This wont give you any issues?

GNU parallel doesn't seem to bad. Seems like a valid option.

ADD REPLYlink written 12 months ago by williamsbrian5064170
1

Don't try that on your cluster without checking with cluster admins. More than one job trying to run on a core may cause unintended consequences for job scheduler (I assume you are using one).

ADD REPLYlink written 12 months ago by genomax68k

Oh our lovely cluster admin. He is always so much fun to talk to :/

I will talk to him and see what he says. I think it will cause an issue. I guess I could just try just submitting jobs by chromosome? Maybe take a bit but none is really using our cluster since it's the summer here. Then just combine all the files into one VCF? Do you think that would work?

ADD REPLYlink written 12 months ago by williamsbrian5064170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 831 users visited in the last hour