Question: Question about Variant Calling: Whole file VS multiple smaller files
0
gravatar for Hatem Elshazly
4.2 years ago by
Egypt
Hatem Elshazly60 wrote:

Hello Everyone,

I wanted to test if the resulting VCF would be the same if I split a BAM file before variant calling against calling variants on the whole BAM as input.
I made 2 different runs: The 1st run was using the splitted BAMs as inputs to HaplotypeCaller which will output 4 VCF files for each splitted BAM Then calling GenotypeGVCFs on all of them using -V option (i.e. -V file1.vcf -V file2.vcf -V file3.vcf -V file4.vcf) the output of that step is one VCF. The 2nd run was using the whole original BAM as input to HaplotypeCaller which will output one VCF <whole_vcf> then calling GenotypeGVCFs.

Run1 Commands:
HaplotypeCaller, for each splitted file (chunk1.bam, chunk2.bam, chunk3.bam, chunk4.bam):
java -jar gatk3.jar -T HaplotypeCaller -R human_g1k_v37.fasta -D dbsnp_137.b37.excluding_sites_after_129.vcf
-o chunk1.vcf -pairHMM VECTOR_LOGLESS_CACHING --emitRefConfidence GVCF --variant_index_type LINEAR
--variant_index_parameter 128000 -A DepthPerAlleleBySample -stand_call_conf 30 -stand_emit_conf 10 -I chunk1.bam

GenotypeGVCFs: java -jar gatk3.jar -T GenotypeGVCFs -R human_g1k_v37.fasta -D dbsnp_137.b37.excluding_sites_after_129.vcf
-o out_from_splits.vcf -A Coverage -A GCContent -A HaplotypeScore -A MappingQualityRankSumTest -A InbreedingCoeff
-A FisherStrand -A QualByDepth -A ChromosomeCounts -V chunk1.vcf -V chunk2.vcf -V chunk3.vcf -V chunk4.vcf


Run2 commands are the same but they are called on a the whole original BAM instead of multiple chunks.

The resulting VCF from GenotypeGVCFs in the 2nd run <whole_vcf> was 33M whereas resulting VCF from the chunks of the 1st run was 7M.

I tried to check what went wrong so I merged the 4 VCFs from HaplotypeCaller using vcf-merge from vcftools then vcf-compare to compare between the merged VCF and the whole VCF (whole_vcf), the number of variants was identical and both files matched expected 3 mismatches detected.

Can anybody tell me what went wrong? Why are the outputs of GenotypeGVCF from the two experiments different? Should I merge the VCFs from the 1st run before Genotyping to get same results?

Thanks in advance,
Shazly

ADD COMMENTlink modified 4.2 years ago by Carlos Borroto1.9k • written 4.2 years ago by Hatem Elshazly60

I think you are not giving enough information. For example when you say you split the BAM file, do you mean you split a multi sample BAM file into BAM files with only data from one sample?, or do you split a single sample BAM file into BAM files with limited regions?

Also, can you share the exact commands you used in each step? The reason for the disagreement could be as simple as a missing argument in one command.

ADD REPLYlink written 4.2 years ago by Carlos Borroto1.9k

Thanks for your reply, I updated the post with commands.

I split a single sample BAM into 4 BAM files with regions, I made sure manually the 4 chunks contained the reads from the original file.

ADD REPLYlink written 4.2 years ago by Hatem Elshazly60
2
gravatar for Carlos Borroto
4.2 years ago by
Carlos Borroto1.9k
Washington Metropolitan Area
Carlos Borroto1.9k wrote:

Okay, thanks for the extra info. A couple of notes and a possible improvement to what you are doing.

First, GenotypeGVCFs is not meant to genotype multiple partial VCF files of the same sample. It is meant for multiple full VCF files of different samples. I'm not sure this is why you are getting the wrong output, but I wouldn't be surprised if that's the case. Second, GenotypeGVCFs value is when run on data from multiple samples. In your case I don't expect you will gain anything over if you just run HaplotypeCaller.

Assuming you are getting your pipeline ready for multiple samples to be processed together. This is what I did recently for scather-gather in the HaplotypeCaller step.

  • I didn't split my BAM files, I just ran HaplotypeCaller with --intervals. I split my BED intervals file in multiple pieces and past one piece at a time. With BAM indexing there should be no penalty for running gatk on the full BAM but only process part of it.
  • I then used bcftools concat to concatenate all the partial VCFs for a single sample
  • I then ran GenotypeGVCFs in all my concatenated individual sample VCFs.

Hope it helps, Carlos

ADD COMMENTlink written 4.2 years ago by Carlos Borroto1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 875 users visited in the last hour