Question

Difficulty Imputing Missing SNPs for Haplotype Construction in PyPGx

2

Entering edit mode

7 weeks ago

Andresa Capodifoglio ▴ 40

Hello everyone, how are you?

I’ve recently started using PyPGx, and I still have some beginner-level questions. I’m facing difficulties with genotyping data imputation: not all missing SNPs in my dataset are being imputed to compose the haplotypes—only some of them. I’m using Beagle for the imputation, as suggested, but I still haven’t been able to recover all the necessary SNPs. I suspect this might be related to the reference population I’m working with.

I’d really appreciate some guidance on the best path to follow. I’ve read quite a lot on the topic, but I can’t seem to move forward. Any advice would be very helpful!

pypgx haplotype imputation beagle genotyping • 5.0k views

ADD COMMENT • link updated 7 weeks ago by Aleksandra ▴ 190 • written 7 weeks ago by Andresa Capodifoglio ▴ 40

score 0 · Answer 1 · 2025-09-04

0

Entering edit mode

7 weeks ago

Aleksandra ▴ 190

I think that 9/10, the necessary SNPs are simply not in your reference, so Beagle cannot impute what it does not know. Also check the match between genome assemblies and chromosome notation (1 vs chr1). Beagle logs often contain the answer. Start by checking one SNP in the reference VCF.

ADD COMMENT • link 7 weeks ago by Aleksandra ▴ 190

2

Entering edit mode

Thank you for your attention. What is the best reference VCF panel to use with Beagle in order to ensure the largest possible number of SNPs for imputation? Should I start with the 1000 Genomes Project, or would it be better to use a denser panel like HRC or TOPMed?

ADD REPLY • link 7 weeks ago by Andresa Capodifoglio ▴ 40

0

Entering edit mode

The denser the panel, the better. TOPMed provides maximum density and the best coverage of rare variants. It is the most powerful option, but also the most resource-intensive. HRC is optimal for samples of European origin: high accuracy at a lower cost, but you risk missing variants specific to other populations. 1kGP is now considered obsolete and too sparse for accurate haplotype determination. Begin with TOPMed.

ADD REPLY • link 7 weeks ago by Aleksandra ▴ 190