Clumping error Plink
Entering edit mode
8 months ago
Agamemnon ▴ 40

I have been trying to develop a GRS.

All chromosomal files from the UK biobank were joined to generate a single merged.bed file. The filter was as follows:

--maf 0.01 \
--hwe 1e-6 \
--geno 0.1 \

I ran:

PLINK v1.90p 64-bit (8 Nov 2021)
Options in effect:
  --bed chr_merged.bed
  --bim chr_merged.bim
  --clump park_updated.score
  --clump-field P
  --clump-kb 250
  --clump-p1 1
  --clump-r2 0.1
  --clump-snp-field SNP
  --fam chr1.fam
  --out chr.qc
  --threads 64

1031886 MB RAM detected; reserving 515943 MB for main workspace.
4113097 variants loaded from .bim file.
487409 people (223038 males, 264368 females, 3 ambiguous) loaded from .fam.
Ambiguous sex IDs written to chr.qc.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 487409 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.99122.
4113097 variants and 487409 people pass filters and QC.
Note: No phenotypes present.

I got the following message:

Warning: 'rs356203' is missing from the main dataset, and is a top variant.
Warning: 'rs356219' is missing from the main dataset, and is a top variant.
Warning: 'rs356215' is missing from the main dataset, and is a top variant.
2357669 more top variant IDs missing; see log file.

I have written on the plink forum, and was informed that my SNP are not in sync, I am not understanding what I have done wrong here.

clumping plink biobank uk • 722 views
Entering edit mode
8 months ago
Sam ★ 4.5k

In short, the variants stated in your warning messages cannot be found in your genotype data. This isn't too surprising as the non-imputed version of UK Biobank genotype data does not usually overlap that much with traditional data (e.g. only ~30% SNPs found in 2013 GIANT)

Entering edit mode

Many thanks, that's a bit strange though, because the original .bgen files which I filtered were imputed.

Entering edit mode

Even if you use imputed data, you will still get mismatch because you likely won't get full coverage of every single SNP. Though if you are using imputed data, then the number of missing ID is a bit high and I would definitely check your bim file and see the coverage.


Login before adding your answer.

Traffic: 1501 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6