Hello am29,
I've seen this kind of issue crop up a few times when merging datasets for population structure analyses, especially with reference panels like 1000 Genomes. The NaN values in Z0/Z1/Z2 and the -1 in PI_HAT for pairs involving your dataset samples (while 1000G pairs look fine) typically indicate that PLINK couldn't compute the IBD/IBS estimates for those specific pairs. This isn't a bug per se, but usually points to data quality or compatibility problems post-merge. I'll break down the likely causes and solutions below, then suggest a better workflow for your ancestry MDS goal.
Likely Reasons for the NaNs/-1
Based on PLINK's --genome mechanics (which relies on method-of-moments estimation of IBD probabilities from IBS sharing across autosomal SNPs):
- Insufficient overlapping/informative SNPs for those pairs: If your dataset and 1000G don't share many variants with called genotypes in both samples of the pair, the computation fails. PLINK needs a reasonable number of non-missing, heterozygous/heterozygous or homozygous/homozygous comparisons to estimate Z0/Z1/Z2 (probabilities of sharing 0/1/2 alleles IBD). With too few (e.g., <100-500 effective SNPs), it defaults to NaN/-1. Your 1000G pairs work because they share the full variant set; your samples likely have sparse overlap.
- High missingness in your samples: If your dataset has low genotyping coverage (e.g., targeted/exome data) or poor call rates at 1000G sites, entire pairs could have zero usable SNPs.
- Allele frequency estimation issues: --genome uses empirical MAFs from the merged dataset to adjust for population structure. If your samples are from a different ancestry (or your dataset is small), this biases the freqs, leading to unstable estimates. (PI_HAT defaults to -1 when the estimator can't converge.)
- Merge artifacts: Strand flips, ref/alt mismatches, or genome build differences (e.g., hg19 vs. hg38) can make variants "non-informative" after merge, effectively creating missing data.
- Less common: Pedigree info in .fam causing founder-only filtering, but that usually affects all pairs uniformly.
The "OT" in RT is odd (should be a number like 0 for unrelated)—might be a placeholder or parse error, but not the core issue.
Quick Diagnostics and Fixes
Run these on your merged dataset (assuming it's in binary format: plink --bfile your_merged --...):
Check overall variant overlap and MAFs:
plink --bfile your_merged --freq --out freq_check
- Inspect
freq_check.frq: How many variants total? Are MAFs sensible (e.g., not all 0 or 1 for your samples)? If your samples drive weird freqs, provide external freqs with --read-freq 1000g.frq (download from 1000G FTP) during --genome.
- If variant count is low (<10k-50k), that's your smoking gun—force overlap with
--extract common_snps.txt (make this from comm -12 <(sort 1000g.bim) <(sort your.bim)).
Check per-sample missingness:
plink --bfile your_merged --missing --out miss_check
- Look at
miss_check.lmiss (variant-level) and miss_check.imiss (sample-level). If your samples have >5-10% missing, filter them out (--mind 0.1) or investigate QC upstream. High missing in your cohort = NaNs downstream.
Rerun --genome with more details:
plink --bfile your_merged --genome full --min 0.05 --out genome_detailed
- The
full modifier adds IBS0/IBS1/IBS2/HOMHOM/HETHET columns. For NaN pairs, check if IBS0+IBS1+IBS2 = 0 (no overlap) or RATIO far from 2 (poor het/hom balance, e.g., strand issues).
--min 0.05 skips very low PI_HAT pairs (unrelated noise), but since yours are -1, it'll help filter junk.
- If IBS counts are zero/low for your pairs: Remerge with stricter variant matching. Use tools like
bcftools norm for strand/ref normalization, or liftOver (e.g., via CrossMap) if builds differ.
Handle merge pitfalls:
- Ensure same genome build: 1000G Phase 3 is hg19; confirm yours.
- Flip-scan for strand issues:
plink --bfile your_merged --flip-scan --out flips then manually flip suspect variants.
- LD-prune first:
plink --bfile your_merged --indep-pairwise 50 5 0.2 --out pruned then run --genome on the pruned set (reduces noise).
- If your data is VCF: Merge with
plink --vcf 1000g.vcf --vcf your.vcf --make-bed --out merged and --double-id if IDs overlap.
Once fixed, your .genome should populate DST (IBS distance) for all pairs—that's what you'd feed into R's cmdscale() for MDS (e.g., mds <- cmdscale(as.dist(1 - ibs_matrix), k=2) where ibs_matrix is from PI_HAT or DST, but DST is better for distance).
Better Approach for Ancestry MDS
Frankly, --genome isn't the best for this—it's optimized for close relatedness (PI_HAT >0.05), and distant ancestry signals get drowned in noise (PI_HAT ~0 everywhere, with pop structure biasing it). For inferring ancestry via MDS/PCA with 1000G as reference:
Use PLINK's --pca directly on the merged genotypes. It computes PCs via Eigendecomp on the GRM (genetic relationship matrix), handling missing data and pop structure robustly.
plink --bfile your_merged --pca 10 --out pca_results
- Outputs
pca_results.eigenvec (loadings) and pca_results.eigenval (eigenvalues). Plot PC1/PC2 colored by known 1000G superpops; project your samples onto it.
- Pre-prune:
--indep-pairwise 50 5 0.2 for ~100k variants (faster, less noisy).
- If overlap is still poor, subset to ancestry-informative markers (e.g., ~50k common SNPs from 1000G; see tutorials like EIGENSTRAT or ADMIXTURE prep).
Alternative: --distance 1d ibs square for an IBS matrix, then MDS in R. But --pca is simpler and standard (e.g., see 1000G PCA tutorials).
If you share more details (e.g., #variants pre/post-merge, your data type, build), I can refine this. Also, check the PLINK Google Group .
Kevin