Expected runtimes for GATK variant calling?
1
0
Entering edit mode
22 months ago
antmantras ▴ 80

Hi all.

I have to perform a variant calling analysis for several plant samples (50) sequenced with WGS. Currently, I am using GATK to accomplish this but I would like to know if the runtimes obtained can be further optimized without increasing the number of threads at some points of the workflow.

My current workflow is:

  • Trim reads and perform QC.
  • Align to reference with bwa mem.
  • Mark duplicates with gatk4 (MarkDuplicatesSpark).
  • Run haplotypecaller for each chromosome (10 in total). The command used here is:
gatk --java-options "-Xmx16g" HaplotypeCaller -R $ref -I $outpath/sorted_dedup_reads.bam -O $outpath/raw_variants_$chr.g.vcf.gz -ERC GVCF -L $chr 
  • CombineGVCFs in one file.
  • GenotypeGVCF.
  • Hard Filtering.
  • Base recalibration.

I have already executed this workflow for one sample for testing purposes. For the 4th step, in a Slurm cluster, the program runs with 4 CPU cores per chromosome (i.e. 40 CPU cores in total) and took around 6 hours to finish. My main concern here is the cost of running the algorithms for the remaining samples as we do not have a proper cluster instead we have rented one. Am I running the 4th correctly? Is there a way to increase the performance of the analysis? Alternatively, we could also use other tools such as bcftools or freebayes if the expected runtimes are much better than GATK. However, as far as I have read, Bcftools only can be parallelized by splitting the analysis by chromosomes and I am not very sure how to give the algorithm this information (i.e. Have I to extract the chromosomes coordinates from my bam files?). Thanks in advance.

Edit: My question is more related not to the time the algorithm takes itself but the number of calculations it performs. If I increase the number of cores it will run faster until a certain amount of them is reached. The problem is that GATK right now takes 6 hours with 4 CPU cores per chromosome, with a total of 10 chromosomes and I have to repeat it for the 49 remaining samples. As I mentioned before, the server is not ours, and the company charges us per CPU core/hours used. That's the reason why I was asking if there are some alternatives or maybe I am running the program wrong because if I run HaplotypeCaller for the remaining samples with the same parameters it would cost half of the budget allocated to computing.

snp variant calling gatk • 749 views
ADD COMMENT
0
Entering edit mode
22 months ago
ahmad mousavi ▴ 800

Your question regarding the computational cost has direct correlation with some factors such as amount of RAM, number of CPU cores and size of genome. As you mentioned it takes 6 hrs which is normal with 4 cores. There is a important difference between bcftools and GATK or freebayes, they used different methods for calling, I would like to try all of them to compare but I would choose GATK for sure, no matter of time ! I think finding the most accurate result is more important than 30-40 % saving in time.

ADD COMMENT
0
Entering edit mode

Hi! Thanks for your response. RAM is not a problem here, I can let the program run with a huge amount of RAM (i.e. ~ 500 GB or more). The genome size is not huge in this case, it is a plant genome with around 800 Mb. My question is more related not to the time the algorithm takes itself but the number of calculations it performs (I will edit the question to reflect that). If I increase the number of cores it will run faster until a certain amount of them is reached. The problem is that GATK right now takes 6 hours with 4 CPU cores per chromosome, with a total of 10 chromosomes and I have to repeat it for the 49 remaining samples. As I mentioned before, the server is not ours, and the company charges us per CPU core/hours used. That's the reason why I was asking if there are some alternatives or maybe I am running the program wrong because if I run HaplotypeCaller for the remaining samples with the same parameters it would cost half of the budget allocated to computing.

ADD REPLY

Login before adding your answer.

Traffic: 2522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6