Question

TOPMed Imputation - More than 10000 allele switches

0

Entering edit mode

3 months ago

Akshaj • 0

I am attempting to impute using the TopMED server, and have followed their data setup pathway. When I ran the imputation, I got the following error message:

**Statistics:
Alternative allele frequency > 0.5 sites: 190,061
Reference Overlap: 98.42 %
Match: 878
Allele switch: 269,268
Strand flip: 0
Strand flip and allele switch: 0
A/T, C/G genotypes: 1,721
Filtered sites:
Filter flag set: 0
Invalid alleles: 0
Multiallelic sites: 0
Duplicated sites: 0
NonSNP sites: 0
Monomorphic sites: 0
Allele mismatch: 95
SNPs call rate < 90%: 0

Excluded sites in total: 269,363
Remaining sites in total: 2,599
See snps-excluded.txt for details
Typed only sites: 4,356
See typed-only.txt for details

Warning: 241 Chunk(s) excluded: < 20 SNPs (see chunks-excluded.txt for details).
Warning: 2 Chunk(s) excluded: reference overlap < 50.0% (see chunks-excluded.txt for details).
Remaining chunk(s): 65
Error: More than 10000 allele switches have been detected. Imputation cannot be started!**

I am unsure of what exactly is causing this. I saw this post but it is 6y old.

Is this the correct strategy to try, or is there something else I can do to determine and/or solve the problem?

Thanks!

imputation topmed • 5.1k views

ADD COMMENT • link updated 16 days ago by Kevin Blighe ★ 90k • written 3 months ago by Akshaj • 0

0

Entering edit mode

Hi,

I have the same error with you, have you sorted it out yet? If you have, could you please share some?

Thank you!

ADD REPLY • link 9 weeks ago by Qianshu • 0

score 0 · Answer 1 · 2025-11-15

This is a common issue with imputation servers like TOPMed, where the quality control step detects an excessive number of allele switches between your input data and the reference panel, preventing the job from proceeding to avoid potentially inaccurate imputations. An allele switch occurs when the reference and alternate alleles in your VCF are swapped compared to the panel, but the strand orientation matches; in your case, with 269,268 switches and only 878 direct matches despite high reference overlap (98.42%), it suggests a systematic mismatch in allele labeling across much of your dataset, rather than isolated errors.

First, double-check that your input VCF is aligned to the correct genome build for the TOPMed reference panel (which is GRCh38 for version R3). If your data is on GRCh37/hg19, you will need to lift it over to GRCh38 using a tool like LiftOver from UCSC, as build mismatches can manifest as apparent allele issues even with good positional overlap. Also, confirm the chromosome naming: for GRCh38, use the 'chr' prefix (e.g., chr1), while GRCh37 expects no prefix.

To address the allele switches directly, I recommend running a pre-imputation quality check and alignment using Will Rayner's tool (HRC-1000G-check-bim, version 4.3.0), which is designed for this purpose and supports the TOPMed panel. This Perl script compares your genotypes against the reference sites, identifies mismatches in strand, allele order, position, and frequency, and generates PLINK commands to correct them (e.g., flipping alleles or strands where needed). You will need to convert your VCF to PLINK format first if it is not already (using plink2 --vcf yourfile.vcf.gz --make-bed --out output), run the script with the TOPMed sites file, apply the fixes, and then convert back to VCF for upload. Download the tool and the TOPMed-specific sites file from https://www.well.ox.ac.uk/~wrayner/tools/; full instructions are on the site, including how to obtain the reference files like the TOPMed freeze 5 legend.

As an alternative, if you prefer working directly with VCF files, you could use the conform-gt program (version 1.0.4) from the BEAGLE utilities to align your data to the TOPMed reference VCF. This tool adjusts strand and allele order by matching to a provided reference file; download it from https://faculty.washington.edu/browning/conform-gt.html, and you will need the TOPMed sites VCF (available via the TOPMed resources or dbSNP for overlapping sites). Run it as java -jar conform-gt.jar ref=reference.vcf.gz gt=yourinput.vcf.gz chrom=1 out=aligned (replacing placeholders accordingly), but note that obtaining the full TOPMed reference VCF can be resource-intensive due to its size.

After corrections, re-run the checkVCF.py tool (available from https://github.com/zhanxw/checkVCF) to validate your updated VCF before resubmitting to TOPMed. In my experience, these pre-alignment steps resolve the vast majority of switch-related errors, though you may need to iterate if frequency differences persist for rare variants. If the problem continues after alignment, it could indicate deeper issues like non-standard allele encoding in your original data, in which case sharing more details about your input source (e.g., array type or sequencing platform) might help narrow it down further.

Kevin