Question

MDS file NaN values

0

Entering edit mode

3 months ago

am29 ▴ 60

I am trying to produce an MDS plot of 1000 genomes individuals and individuals from my dataset to infer the ancestry of individuals in my dataset. However, after merging the files and running --genome command in PLINK, my .genome file is full of NaN values, and looks like this:

FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO    
0   1000g_sample_1 0 Mydataset_sample_1 OT 0 nan nan nan nan -1 nan NA NA 0

When examining the genome file, distances are calculated for individuals from 1000 genomes, but none are calculated for individuals in my dataset.

What could be the reason/solution for this?

missing genome MDS • 485 views

ADD COMMENT • link updated 7 hours ago by Kevin Blighe 89k • written 3 months ago by am29 ▴ 60

score 0 · Answer 1 · 2025-11-07

Hello am29,

I've seen this kind of issue crop up a few times when merging datasets for population structure analyses, especially with reference panels like 1000 Genomes. The NaN values in Z0/Z1/Z2 and the -1 in PI_HAT for pairs involving your dataset samples (while 1000G pairs look fine) typically indicate that PLINK couldn't compute the IBD/IBS estimates for those specific pairs. This isn't a bug per se, but usually points to data quality or compatibility problems post-merge. I'll break down the likely causes and solutions below, then suggest a better workflow for your ancestry MDS goal.

Likely Reasons for the NaNs/-1

Based on PLINK's --genome mechanics (which relies on method-of-moments estimation of IBD probabilities from IBS sharing across autosomal SNPs):

Insufficient overlapping/informative SNPs for those pairs: If your dataset and 1000G don't share many variants with called genotypes in both samples of the pair, the computation fails. PLINK needs a reasonable number of non-missing, heterozygous/heterozygous or homozygous/homozygous comparisons to estimate Z0/Z1/Z2 (probabilities of sharing 0/1/2 alleles IBD). With too few (e.g., <100-500 effective SNPs), it defaults to NaN/-1. Your 1000G pairs work because they share the full variant set; your samples likely have sparse overlap.
High missingness in your samples: If your dataset has low genotyping coverage (e.g., targeted/exome data) or poor call rates at 1000G sites, entire pairs could have zero usable SNPs.
Allele frequency estimation issues: --genome uses empirical MAFs from the merged dataset to adjust for population structure. If your samples are from a different ancestry (or your dataset is small), this biases the freqs, leading to unstable estimates. (PI_HAT defaults to -1 when the estimator can't converge.)
Merge artifacts: Strand flips, ref/alt mismatches, or genome build differences (e.g., hg19 vs. hg38) can make variants "non-informative" after merge, effectively creating missing data.
Less common: Pedigree info in .fam causing founder-only filtering, but that usually affects all pairs uniformly.

The "OT" in RT is odd (should be a number like 0 for unrelated)—might be a placeholder or parse error, but not the core issue.

Quick Diagnostics and Fixes

Run these on your merged dataset (assuming it's in binary format: plink --bfile your_merged --...):

Check overall variant overlap and MAFs:
```
plink --bfile your_merged --freq --out freq_check
```
- Inspect freq_check.frq: How many variants total? Are MAFs sensible (e.g., not all 0 or 1 for your samples)? If your samples drive weird freqs, provide external freqs with --read-freq 1000g.frq (download from 1000G FTP) during --genome.
- If variant count is low (<10k-50k), that's your smoking gun—force overlap with --extract common_snps.txt (make this from comm -12 <(sort 1000g.bim) <(sort your.bim)).
Check per-sample missingness:
```
plink --bfile your_merged --missing --out miss_check
```
- Look at miss_check.lmiss (variant-level) and miss_check.imiss (sample-level). If your samples have >5-10% missing, filter them out (--mind 0.1) or investigate QC upstream. High missing in your cohort = NaNs downstream.
Rerun --genome with more details:
```
plink --bfile your_merged --genome full --min 0.05 --out genome_detailed
```
- The full modifier adds IBS0/IBS1/IBS2/HOMHOM/HETHET columns. For NaN pairs, check if IBS0+IBS1+IBS2 = 0 (no overlap) or RATIO far from 2 (poor het/hom balance, e.g., strand issues).
- --min 0.05 skips very low PI_HAT pairs (unrelated noise), but since yours are -1, it'll help filter junk.
- If IBS counts are zero/low for your pairs: Remerge with stricter variant matching. Use tools like bcftools norm for strand/ref normalization, or liftOver (e.g., via CrossMap) if builds differ.
Handle merge pitfalls:
- Ensure same genome build: 1000G Phase 3 is hg19; confirm yours.
- Flip-scan for strand issues: plink --bfile your_merged --flip-scan --out flips then manually flip suspect variants.
- LD-prune first: plink --bfile your_merged --indep-pairwise 50 5 0.2 --out pruned then run --genome on the pruned set (reduces noise).
- If your data is VCF: Merge with plink --vcf 1000g.vcf --vcf your.vcf --make-bed --out merged and --double-id if IDs overlap.

Once fixed, your .genome should populate DST (IBS distance) for all pairs—that's what you'd feed into R's cmdscale() for MDS (e.g., mds <- cmdscale(as.dist(1 - ibs_matrix), k=2) where ibs_matrix is from PI_HAT or DST, but DST is better for distance).

Better Approach for Ancestry MDS

Frankly, --genome isn't the best for this—it's optimized for close relatedness (PI_HAT >0.05), and distant ancestry signals get drowned in noise (PI_HAT ~0 everywhere, with pop structure biasing it). For inferring ancestry via MDS/PCA with 1000G as reference:

Use PLINK's --pca directly on the merged genotypes. It computes PCs via Eigendecomp on the GRM (genetic relationship matrix), handling missing data and pop structure robustly.
```
plink --bfile your_merged --pca 10 --out pca_results
```
- Outputs pca_results.eigenvec (loadings) and pca_results.eigenval (eigenvalues). Plot PC1/PC2 colored by known 1000G superpops; project your samples onto it.
- Pre-prune: --indep-pairwise 50 5 0.2 for ~100k variants (faster, less noisy).
- If overlap is still poor, subset to ancestry-informative markers (e.g., ~50k common SNPs from 1000G; see tutorials like EIGENSTRAT or ADMIXTURE prep).
Alternative: --distance 1d ibs square for an IBS matrix, then MDS in R. But --pca is simpler and standard (e.g., see 1000G PCA tutorials).

If you share more details (e.g., #variants pre/post-merge, your data type, build), I can refine this. Also, check the PLINK Google Group .

Kevin