Question

best practice for diploid variant calling

0

Entering edit mode

8 weeks ago

Matteo Ungaro ▴ 130

Hi I'm working on an experiment to benchmark variant calling using both a graph and a reference-based approach. I got the graph part working smoothly, but the reference approach is giving me some trouble... In particular, I need to evaluate calls for both haplotypes of the same individuals and I'm testing three strategies:

merge hap1 and hap2 in a single FASTA with univocal headers, then align
align independently to each haplotype and merge the BAMs (see below for details), then call
TBD – same as point 2. but merge the VCF generated after calling from each BAM independently

I'm not sure which is the more “correct” way to face this problem considering there will always be a compromise to be made. Specifically,

point 1

Everything works just fine down to variant calling but then I'm having problem to set the benchmark with hap.py since the tool requires a unique reference – in my case I picked hap1 as it is the better one – and in order to fix chromosome headers and duplicate calls in the output VCF after variant calling I've done the following:

zcat <output>.vcf.gz | awk -F '\t' '{ OFS="\t"; gsub("_hap[0-9]", "", $1); print }' | bcftools sort | bcftools norm -D -Oz -o <output>-bench.vcf.gz && tabix -p vcf <output>-bench.vcf.gz

point 2

I order to merge calls from BAMs aligned to different references (haplotype of the same individual in this case) I used bam-mergeRef since I'm not sure of a way to do so with samtools merge; the problem with this is that the variant calling step yield an empty VCF...

This is probably the more parsimonious way to do this I guess, so if anyone has input/experience about a possible workaround let me know!

point 3

I haven't thoroughly tested this yet, but my idea is to independently call BAM files aligned to each haplotype, produce the corresponding VCF files and then merge calls; I set up a command to do so but I have never done this before, so also here I'm open to any input/feedback:

bcftools merge --merge all <output>-hap1.vcf.gz <output>-hap2.vcf.gz -Oz -o <output>-merged.vcf.gz

First of all, I'll be mostly interested in knowing which one is the more unbiased way to address this issue; secondly, I'm aware point 1 would require a bit of engineering on the hap.py side to understand what is causing an error as well as that is probably not advised but if someone has done this before any help is appreciated!

Sorry for the long post, I just want to finish saying I'm working with HiFi long reads and aligning them with minimap2, then calling variants with deepvariant.

variant-calling haplotype samtools diploid bcftools • 3.7k views

ADD COMMENT • link updated 8 weeks ago by cmdcolin ★ 4.4k • written 8 weeks ago by Matteo Ungaro ▴ 130

0

Entering edit mode

It sounds like you are calling variants with respect to a "phased assembly" (e.g. a diploid assembly has both a maternal and paternal haploid genome assembly). are you aligning the sample against itself e.g. the reads used to create the assembly are aligned against the assembly itself?

ADD REPLY • link 8 weeks ago by cmdcolin ★ 4.4k

0

Entering edit mode

@cmdcolin indeed, you're correct! We do have a parental cell line which was used to build the two phased assemblies with Verkko, then this same cell line has been exposed to different conditions and de novo sequenced.

By aligning the reads used to assemble the two haplotypes as well as those exposed to different experimental conditions, we are looking to better understand mutational patterns taking advantage of the ground truth in our two assemblies.

ADD REPLY • link 8 weeks ago by Matteo Ungaro ▴ 130

1

Entering edit mode

that is quite a cool experiment. I am not actually sure of the best approach (sorry to disappoint). I think that if you look up "phased assembly variant calling" there are some discussions like in Variant calling analysis on phased assembly

I think that aligning to a graph that is composed of both phased assemblies is, at least theoretically, pretty ideal. the other solutions like you describe above seem to be a bit more complex...you have to consider whether the aligner would understand having extremely similar sequences if you align reads to the concatenation of both the maternal and paternal haplotypes....or you would have to consider the penalties involved for the aligner to align to both the maternal and paternal haplotype separately, which would implicitly align e.g. maternal reads to the paternal haplotype, and vice versa. So, tricky problem!

ADD REPLY • link 8 weeks ago by cmdcolin ★ 4.4k