Hi,
I am using samtools for variant calling. The issue I am having is that samtools mpileup seems to be taking a very long time. I am running samtools on a cluster and have a set how long samtool will run. The first time I tried this I gave it 7 days to run with 256 Gbs of RAM. The resulting file only had two chromosome of variant calling. I am working with about 40 samples and they range from 10-30 Gbs so I realize it will take some time. I am just worried I am doing something wrong here? I feel like it shouldn't take this long.
Here is the command I am using for samtool
samtools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF -C 50 --output-tags DP,DV,DPR,INFO/DPR,DP4,SP -u -f refgenome.fa sample1.bam sample40.bam > allsample.bcf
What do you think? Any suggestion on how to speed this up? I am using -C 50 because I used bwa.
OP is using a cluster so starting parallel jobs (instead of
parallel
) would be the way to go (within bounds of allocated job slots). Assuming this is a high-performance compute cluster I/O should not be an issue.Requesting one node with, say 64 threads, I do not see any problem in using all the threads in one job with
-j 63
leaving one thread for parallel itself. Did that for dozens of jobs.I am not a parallel user so I will defer to you. Do you use a job scheduler and if so how does parallel interact with scheduler? Ultimately it is the scheduler that manages running jobs and I assume every job/process parallel spawns shows up with an individual PID. Admins frown on running jobs outside of scheduler in most places.
I see your point. I always book an entire node for these jobs, so it cannot interfere with other user's processes. probably the best would be to talk to OPs admin as you suggested above.
It is a high-performance compute cluster
You can run samtools on each chromosome all at once? This wont give you any issues?
GNU parallel doesn't seem to bad. Seems like a valid option.
Don't try that on your cluster without checking with cluster admins. More than one job trying to run on a core may cause unintended consequences for job scheduler (I assume you are using one).
Oh our lovely cluster admin. He is always so much fun to talk to :/
I will talk to him and see what he says. I think it will cause an issue. I guess I could just try just submitting jobs by chromosome? Maybe take a bit but none is really using our cluster since it's the summer here. Then just combine all the files into one VCF? Do you think that would work?