I wanted to test if the resulting VCF would be the same if I split a BAM file before variant calling against calling variants on the whole BAM as input.
I made 2 different runs: The 1st run was using the splitted BAMs as inputs to HaplotypeCaller which will output 4 VCF files for each splitted BAM Then calling GenotypeGVCFs on all of them using -V option (i.e. -V file1.vcf -V file2.vcf -V file3.vcf -V file4.vcf) the output of that step is one VCF. The 2nd run was using the whole original BAM as input to HaplotypeCaller which will output one VCF <whole_vcf> then calling GenotypeGVCFs.
HaplotypeCaller, for each splitted file (chunk1.bam, chunk2.bam, chunk3.bam, chunk4.bam):
java -jar gatk3.jar -T HaplotypeCaller -R human_g1k_v37.fasta -D dbsnp_137.b37.excluding_sites_after_129.vcf
-o chunk1.vcf -pairHMM VECTOR_LOGLESS_CACHING --emitRefConfidence GVCF --variant_index_type LINEAR
--variant_index_parameter 128000 -A DepthPerAlleleBySample -stand_call_conf 30 -stand_emit_conf 10 -I chunk1.bam
java -jar gatk3.jar -T GenotypeGVCFs -R human_g1k_v37.fasta -D dbsnp_137.b37.excluding_sites_after_129.vcf
-o out_from_splits.vcf -A Coverage -A GCContent -A HaplotypeScore -A MappingQualityRankSumTest -A InbreedingCoeff
-A FisherStrand -A QualByDepth -A ChromosomeCounts -V chunk1.vcf -V chunk2.vcf -V chunk3.vcf -V chunk4.vcf
Run2 commands are the same but they are called on a the whole original BAM instead of multiple chunks.
The resulting VCF from GenotypeGVCFs in the 2nd run <whole_vcf> was 33M whereas resulting VCF from the chunks of the 1st run was 7M.
I tried to check what went wrong so I merged the 4 VCFs from HaplotypeCaller using vcf-merge from vcftools then vcf-compare to compare between the merged VCF and the whole VCF (whole_vcf), the number of variants was identical and both files matched expected 3 mismatches detected.
Can anybody tell me what went wrong? Why are the outputs of GenotypeGVCF from the two experiments different? Should I merge the VCFs from the 1st run before Genotyping to get same results?
Thanks in advance,
I think you are not giving enough information. For example when you say you split the BAM file, do you mean you split a multi sample BAM file into BAM files with only data from one sample?, or do you split a single sample BAM file into BAM files with limited regions?
Also, can you share the exact commands you used in each step? The reason for the disagreement could be as simple as a missing argument in one command.
Thanks for your reply, I updated the post with commands.
I split a single sample BAM into 4 BAM files with regions, I made sure manually the 4 chunks contained the reads from the original file.