Question

best practice for diploid variant calling

0

Entering edit mode

12 weeks ago

Matteo Ungaro ▴ 130

Hi I'm working on an experiment to benchmark variant calling using both a graph and a reference-based approach. I got the graph part working smoothly, but the reference approach is giving me some trouble... In particular, I need to evaluate calls for both haplotypes of the same individuals and I'm testing three strategies:

merge hap1 and hap2 in a single FASTA with univocal headers, then align
align independently to each haplotype and merge the BAMs (see below for details), then call
TBD – same as point 2. but merge the VCF generated after calling from each BAM independently

I'm not sure which is the more “correct” way to face this problem considering there will always be a compromise to be made. Specifically,

point 1

Everything works just fine down to variant calling but then I'm having problem to set the benchmark with hap.py since the tool requires a unique reference – in my case I picked hap1 as it is the better one – and in order to fix chromosome headers and duplicate calls in the output VCF after variant calling I've done the following:

zcat <output>.vcf.gz | awk -F '\t' '{ OFS="\t"; gsub("_hap[0-9]", "", $1); print }' | bcftools sort | bcftools norm -D -Oz -o <output>-bench.vcf.gz && tabix -p vcf <output>-bench.vcf.gz

point 2

I order to merge calls from BAMs aligned to different references (haplotype of the same individual in this case) I used bam-mergeRef since I'm not sure of a way to do so with samtools merge; the problem with this is that the variant calling step yield an empty VCF...

This is probably the more parsimonious way to do this I guess, so if anyone has input/experience about a possible workaround let me know!

point 3

I haven't thoroughly tested this yet, but my idea is to independently call BAM files aligned to each haplotype, produce the corresponding VCF files and then merge calls; I set up a command to do so but I have never done this before, so also here I'm open to any input/feedback:

bcftools merge --merge all <output>-hap1.vcf.gz <output>-hap2.vcf.gz -Oz -o <output>-merged.vcf.gz

First of all, I'll be mostly interested in knowing which one is the more unbiased way to address this issue; secondly, I'm aware point 1 would require a bit of engineering on the hap.py side to understand what is causing an error as well as that is probably not advised but if someone has done this before any help is appreciated!

Sorry for the long post, I just want to finish saying I'm working with HiFi long reads and aligning them with minimap2, then calling variants with deepvariant.

variant-calling haplotype samtools diploid bcftools • 4.0k views

ADD COMMENT • link 1 hour ago by Matteo Ungaro ▴ 130

0

Entering edit mode

It sounds like you are calling variants with respect to a "phased assembly" (e.g. a diploid assembly has both a maternal and paternal haploid genome assembly). are you aligning the sample against itself e.g. the reads used to create the assembly are aligned against the assembly itself?

ADD REPLY • link 12 weeks ago by cmdcolin ★ 4.4k

0

Entering edit mode

@cmdcolin indeed, you're correct! We do have a parental cell line which was used to build the two phased assemblies with Verkko, then this same cell line has been exposed to different conditions and de novo sequenced.

By aligning the reads used to assemble the two haplotypes as well as those exposed to different experimental conditions, we are looking to better understand mutational patterns taking advantage of the ground truth in our two assemblies.

ADD REPLY • link 12 weeks ago by Matteo Ungaro ▴ 130

1

Entering edit mode

that is quite a cool experiment. I am not actually sure of the best approach (sorry to disappoint). I think that if you look up "phased assembly variant calling" there are some discussions like in Variant calling analysis on phased assembly

I think that aligning to a graph that is composed of both phased assemblies is, at least theoretically, pretty ideal. the other solutions like you describe above seem to be a bit more complex...you have to consider whether the aligner would understand having extremely similar sequences if you align reads to the concatenation of both the maternal and paternal haplotypes....or you would have to consider the penalties involved for the aligner to align to both the maternal and paternal haplotype separately, which would implicitly align e.g. maternal reads to the paternal haplotype, and vice versa. So, tricky problem!

ADD REPLY • link 12 weeks ago by cmdcolin ★ 4.4k

score 1 · Answer 1 · 2025-11-14

Cool experiment! Aligning HiFi reads back to phased Verkko assemblies for mutation benchmarking sounds solid.

For the reference-based approach, I think point 3 (calling variants separately on each haplotype then merging VCFs) is the least biased. It avoids multi-mapping issues from concatenating FASTAs (point 1) and doesn't force BAM merging across differing refs (point 2). Use bcftools merge --merge all, but first normalize contig names (e.g., remove _hap suffixes) to align chromosomes. Watch for overlapping positions—run bcftools norm -m+any afterward to handle multi-allelic sites.

If haplotypes have structural differences, consider lift-over tools like CrossMap for better alignment before merging. This keeps things phased-aware without projecting everything to one hap.

For hap.py benchmarking, build a diploid truth set by combining phased variants from your assemblies, or use one hap as ref but flag potential biases in results.

Graph methods like PGGB (from the pangenome toolkit) are theoretically best for diploid handling—glad that's working smoothly. Check PacBio's HiFi variant calling guide for DeepVariant tweaks: align with pbmm2, phase with WhatsHap post-calling.

No perfect compromise, but point 3 minimizes reference bias. If empty VCFs persist in point 2, debug the bam-mergeRef step—might need a combined ref FASTA for DeepVariant.

Kevin