It is a known fact that GATK does not scale or work very well when a draft assembly has a lot of contigs. Mine has 366989 (water buffalo reference genome). I have successfully generated GVCFs of all my 81 samples for the whole genome through Haplotypecaller without any errors. It usually fails or freezes in the CombineGVCF stage. I have thought of a strategy and would like to know if there may be any pitfalls or any important point I need to consider.
Since, I am only interested in my 'genes of interest', and genes are generally not present in all my contigs, I select only those lines from my GVCF files that belong to those contigs and leave the rest. It reduced my total no. of contigs to 4338. (Used this command to find that- cat GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.revised.gtf|cut -f 1|uniq|sort >contigs_with_genes.txt)
I then split the 4338 contigs to sets of 23 contigs (because GATK has been made keeping humans in mind), split each GVCF files into 23 smaller GVCF files and then run CombineGVCF on my 81 samples.
I will end up with 23 combined GVCF files and I can run GenotypeGVCF on them to get final VCF file which I can merge vertically to get my final VCF file.
PS. I cannot run GenomicsDBimport (which is GATK's current recommendation in place of CombineGVCF) because of the caveat that it can run on one contig at a time and even though I have access to a grid engine, my calculation suggested it will take years to get over because of my number of contigs.
I also research the buffalo (River buffalo) whole genome sequencing data. I have 50 sample whole genome sequencing and now I'm Recalibration stage from SNP Calling. But, as you know Buffalo do not have a known sites VCF file, Right? So, i have long time remained at this stage and i have not found a satisfactory solution. But today i was happy to see your post. Please help me to solve this problem, if possible.
You said that, generat GVCFs from all samples for the whole genome through Haplotypecaller without any errors, Right? Ok, But did you do the Recalibration before doing Haplotypecaller?
You are right. Water buffalo does not have a known sites VCF file. I have avoided this step.
This document may help:https://software.broadinstitute.org/gatk/best-practices/workflow (especially What is not GATK Best Practices?)
The main point here is you may report that you have 'adapted' the GATK germline variant calling pipeline and have removed certain steps due to lack of truth set.Another way is the bootstrapping procedure if you have a lot of time in hand. It is mentioned in this link: https://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr
The main point here is:
However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.
many thanks for your reply,
But I do not have enough time because I am the last semester student and it's a very important time for me. On the other hand, there have been people who have already gone through this path and have concluded that this method does not work (http://evodify.com/gatk-the-best-practice-for-genotype-calling-in-a-non-model-organism/). So, I think we should remove the Recalibration step.
But my request is for you to support me in the next steps, if possible. I also want to use the Haplotypecaller option like you. So, I want to know what your command line was for this stage?
I used the default:
java -Xmx8g -jar HaplotypeCaller -R <your_reference_genome> --native-pair-hmm-threads 4 -I <your_bam_file> -ERC GVCF -O output.g.vcf
What is the option --native-pair-hmm-threads 4 ??
According to this command line, you've done the HaplotypeCaller for a sample, right? And then you combined the generated GVCFs files, right?
Yes..I did that only..
--native-pair-hmm-threads allows you to define no. of threads that will be utilised to compute the pairHMM step of haplotypecaller...
For more information, check this: https://software.broadinstitute.org/gatk/documentation/tooldocs/18.104.22.168/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php