best practice for diploid variant calling
0
0
Entering edit mode
6 hours ago
Matteo Ungaro ▴ 130

Hi I'm working on an experiment to benchmark variant calling using both a graph and a reference-based approach. I got the graph part working smoothly, but the reference approach is giving me some trouble... In particular, I need to evaluate calls for both haplotypes of the same individuals and I'm testing three strategies:

  1. merge hap1 and hap2 in a single FASTA with univocal headers, then align
  2. align independently to each haplotype and merge the BAMs (see below for details), then call
  3. TBD – same as point 2. but merge the VCF generated after calling from each BAM independently

I'm not sure which is the more “correct” way to face this problem considering there will always be a compromise to be made. Specifically,


point 1

Everything works just fine down to variant calling but then I'm having problem to set the benchmark with hap.py since the tool requires a unique reference – in my case I picked hap1 as it is the better one – and in order to fix chromosome headers and duplicate calls in the output VCF after variant calling I've done the following:

zcat <output>.vcf.gz | awk -F '\t' '{ OFS="\t"; gsub("_hap[0-9]", "", $1); print }' | bcftools sort | bcftools norm -D -Oz -o <output>-bench.vcf.gz && tabix -p vcf <output>-bench.vcf.gz

point 2

I order to merge calls from BAMs aligned to different references (haplotype of the same individual in this case) I used bam-mergeRef since I'm not sure of a way to do so with samtools merge; the problem with this is that the variant calling step yield an empty VCF...

This is probably the more parsimonious way to do this I guess, so if anyone has input/experience about a possible workaround let me know!


point 3

I haven't thoroughly tested this yet, but my idea is to independently call BAM files aligned to each haplotype, produce the corresponding VCF files and then merge calls; I set up a command to do so but I have never done this before, so also here I'm open to any input/feedback:

bcftools merge --merge all <output>-hap1.vcf.gz <output>-hap2.vcf.gz -Oz -o <output>-merged.vcf.gz

First of all, I'll be mostly interested in knowing which one is the more unbiased way to address this issue; secondly, I'm aware point 1 would require a bit of engineering on the hap.py side to understand what is causing an error as well as that is probably not advised but if someone has done this before any help is appreciated!

Sorry for the long post, I just want to finish saying I'm working with HiFi long reads and aligning them with minimap2, then calling variants with deepvariant.

variant-calling haplotype samtools diploid bcftools • 62 views
ADD COMMENT

Login before adding your answer.

Traffic: 2397 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6