Question: How to make variant calling run faster?
gravatar for steve
3.3 years ago by
United States
steve2.7k wrote:

We are doing a lot of variant calling using GATK MuTect2. Evidently, multi-threading does not work in this program, so we have to run it single threaded. It takes a very long time, ~24-36 hours to finish. Anyone know how to speed it up?

I had an idea of splitting the .bam into separate files per-chromosome and then running them all separately in parrallel, but I was not sure how difficult it would be and how hard it would be to merge everything back together into a single .vcf output at the end. Any thoughts?

mutect2 gatk variant calling • 3.9k views
ADD COMMENTlink modified 3.3 years ago by Brian Bushnell17k • written 3.3 years ago by steve2.7k

looks like my second part there is already mentioned here

ADD REPLYlink written 3.3 years ago by steve2.7k
gravatar for jared.andrews07
3.3 years ago by
Memphis, TN
jared.andrews078.3k wrote:

Your second option is the right one. Breaking up BAMs by chromosome is easy as seen in your linked question. Concatenating the VCFs is also pretty easy with bcftools. Just look for the concat command.

ADD COMMENTlink written 3.3 years ago by jared.andrews078.3k

I ended up doing the reverse, and splitting my .bed file with target regions into one .bed per chromosome, and then running the same original .bam files against each of the split .bed files in parrallel. So far it has reduced 30+ hour variant calling down to ~4 hours as per the longest-running chromosome (most finished <2 hours). Helpful details here

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by steve2.7k
gravatar for pfs
3.3 years ago by
pfs280 wrote:

If speed is your main concern and tool development is not an option I would change my pipeline to use Samtools for variant calling. I have benchmark variant calling tools in the past and Samtools is a lot faster than GATK haplotypecaller v3.5. I can not speak for newer versions of GATK.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by pfs280

It's also a lot less accurate by many other benchmarks. I would not recommend samtools for variant calling, particularly if you care about indels or low frequency variants. It will not detect variants with a VAF <0.20. It definitely is quicker than GATK, but it's not as sensitive or accurate.

See this paper for benchmarks of many different callers (though there are many other papers out there like this as well).

ADD REPLYlink written 3.3 years ago by jared.andrews078.3k

That is absolutely not true, based on my own benchmarking. SAMtools (BCFtools mpileup piped into BCFtools call) can easily detect low frequency variants.

but it's not as sensitive or accurate.

You have not qualified this statement at all, Jared. From my own work in the clinical setting (and work of others), SAMtools has higher sensitivity to Sanger sequencing compared to GATK's tools.

ADD REPLYlink modified 21 months ago • written 21 months ago by Kevin Blighe69k

I appreciate you qualifying your statements with hard facts and not anecdotal evidence. Here are some other publications supporting GATK generally having higher sensitivity, precision, and recall than SAMtools.

You're correct in stating that variant calling is a complex task with many variables (sequencing platform, aligner, literally dozens of parameters for most callers, etc), but every comparison I find (and my own experience) has shown that SAMtools isn't great at recalling low frequency SNVs (especially with low read depth) or indels. There is a reason that most of the top entries in the FDA Precision Medicine Variant Calling Challenges used GATK. Variant calling has changed drastically over the last few years, so it's possible newer versions of htslib/SAMtools/BCFtools may be better - I can't say. But most side-by-side comparisons pretty clearly indicate that GATK is just objectively better in most cases. Of course, the differences are usually a few percent at most, so it's not like SAMtools is atrocious or anything. SAMtools may still be the best tool for the job in some cases - it's really up to users to decide. But in general, I'm not going to withhold recommending a great tool just because another works "good enough".

I am glad that SAMtools has worked well for you in the clinical setting, which is an area that deserves a comparison on its own. I would be interested in seeing any reports from clinical laboratories backed by Sanger, your own benchmarking work, or any studies that indicate the superiority of SAMtools over GATK for general variant/small indel calling. I have no loyalty to either, so I'm more than happy to be proven wrong.

ADD REPLYlink written 21 months ago by jared.andrews078.3k

In addition, the problem with reading a single instantiated report, like the Scientific Reports manuscript that you've cited, is that there are are literally hundreds of parameters going into each variant calling pipeline. No single publication can account for all parameters and their utility across a diverse range of samples.

GATK frequently misses variants that are confirmed in samples via Sanger, while SAMtools easily detects these. This has been reported in independent clinical laboratories.

ADD REPLYlink written 21 months ago by Kevin Blighe69k
gravatar for Brian Bushnell
3.3 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

You can make variant-calling much faster by using BBMap's tool (which is fully multithreaded) on a raw, unsorted sam file, so you don't have to waste time sorting, indexing and compressing/decompressing bam. It works on bam files too, though, so if you want to use a bam as input I suggest you have use samtools 1.4 or greater in the command line so that the decompression will also be multithreaded. is around 90x faster than the samtools mpileup pipeline, in my testing

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1686 users visited in the last hour