Hi I'm working on an experiment to benchmark variant calling using both a graph and a reference-based approach. I got the graph part working smoothly, but the reference approach is giving me some trouble... In particular, I need to evaluate calls for both haplotypes of the same individuals and I'm testing three strategies:
- merge
hap1
andhap2
in a single FASTA with univocal headers, then align - align independently to each haplotype and merge the BAMs (see below for details), then call
- TBD – same as point
2.
but merge the VCF generated after calling from each BAM independently
I'm not sure which is the more “correct” way to face this problem considering there will always be a compromise to be made. Specifically,
point 1
Everything works just fine down to variant calling but then I'm having problem to set the benchmark with hap.py
since the tool requires a unique reference – in my case I picked hap1
as it is the better one – and in order to fix chromosome headers and duplicate calls in the output VCF after variant calling I've done the following:
zcat <output>.vcf.gz | awk -F '\t' '{ OFS="\t"; gsub("_hap[0-9]", "", $1); print }' | bcftools sort | bcftools norm -D -Oz -o <output>-bench.vcf.gz && tabix -p vcf <output>-bench.vcf.gz
point 2
I order to merge calls from BAMs aligned to different references (haplotype of the same individual in this case) I used bam-mergeRef
since I'm not sure of a way to do so with samtools merge
; the problem with this is that the variant calling step yield an empty VCF...
This is probably the more parsimonious way to do this I guess, so if anyone has input/experience about a possible workaround let me know!
point 3
I haven't thoroughly tested this yet, but my idea is to independently call BAM files aligned to each haplotype, produce the corresponding VCF files and then merge calls; I set up a command to do so but I have never done this before, so also here I'm open to any input/feedback:
bcftools merge --merge all <output>-hap1.vcf.gz <output>-hap2.vcf.gz -Oz -o <output>-merged.vcf.gz
First of all, I'll be mostly interested in knowing which one is the more unbiased way to address this issue; secondly, I'm aware point 1 would require a bit of engineering on the hap.py
side to understand what is causing an error as well as that is probably not advised but if someone has done this before any help is appreciated!
Sorry for the long post, I just want to finish saying I'm working with HiFi long reads and aligning them with minimap2
, then calling variants with deepvariant
.