Hi there,
I have diploid calls from long reads HiFi for a sample on both his haplotypes/assemblies (I'm working with humans, so I have only hap1 and hap2). Is there a way to correctly merge these two files based on a single reference?
The idea is to test and benchmark the effect of mapping to the genome of origin for the sample against any linear reference, which in theory should have better performance. I'm open to other suggestions if this won't be possible; for instance, mapping to the "most complete" between the two haplotypes then combine the two VCF files to prevent issues with mismatches in reference calls.
This would be still better than mapping to a different reference but won't capture variants inherent the other haplotype. Let me know what you think, thanks in advance!
this question is a little similar to your previous question (best practice for diploid variant calling). just as added info, here is a tool called "phased assembly variant caller" https://github.com/EichlerLab/pav
it is one tool of "assembly-based variant calling", there may be other tools that are more focused on smaller or larger variants
the idea of aligning the reads used to create an assembly against that same assembly is a little bit of a tricky thing. it can be useful, but with perfect mapping and perfect assembly, there would be "no variants". so any work in that direction basically uncovers misassemblies or misalignments, or, bias from aligning reads from the wrong "haplotype". the reads could be separated by haplotype using techniques like whatshap or maybe some other technique