Hello,
I have a GATK pipeline following best practices workflow to call rare germline SNPs and INDELs. I wanted to reduce time and so I split the processing at chromosome level
In one analysis, I split the aligned and duplicates marked bam file by chromosomes and ran base recalibration(BQSR) and haplotype caller (regular mode, not gvcf) and then concatenated the vcf files of each chromosome of a sample into 1.
In another one, I ran the exact same commands without splitting at chromosome level. I filtered the non standard chromosomes from the result and when I tried to compare the results to the earlier vcf file , there were a couple 100 different variant calls, and in the calls that were matching handful of them had different genotype.
At the moment I am comparing the vcf calls using vcfR package and merging based on CHROM, POS, REF and ALT columns.
Is this expected behavior? I thought that base recalibration and haplotype caller don't consider regions across contigs for it's functionality.
Should I filter the non-standard chromosomes even before running base recalibration by providing the standard chromosomes as a list in -L in the analysis where I am not splitting the chromosomes?
Thank you,
Your guidance is greatly appreciated.
Show us some examples.
This is the script (I am running it on JupyterLab):
Output:
Common Variants:
head(merged_df)
Unique Variants:
thanks but that is not what I meant. Please show us those variants in the VCF files, i want to see the quality, the FILTER, the DEPTH, etc...
for example chr1-14932 on grch38 has usually a very bad quality https://gnomad.broadinstitute.org/variant/1-14932-G-T?dataset=gnomad_r4
furthermore, you'd better use
bcftools isec
to compare VCF files.I did not compare them after filtration. this was as soon as I called the haplotype caller to verify my pipeline's integrity after splitting by chromosomes.
So these are the calls that are common but had different genotypes:
This is the INFO field comparison for these: