Question

Highly inflated p-values in GWAS by regenie

0

Entering edit mode

7 months ago

cwwong13 ▴ 40

I was running a GWAS using REGENIE 3.2.5 on more than 250,000 samples, and the p-values returned are highly inflated with -log10P up to 5000. As a result there were over 10,000 variants called significant under the threshold of p < 5e-8, which is a huge increase in number compared with previous studies and therefore I am suspecting inflation of the values by some unknown reason. I have briefly checked the github repository of REGENIE (this, and this) and the issue of inflated p-values was reported but without a satisfying answer.

I used the same set of codes on other groups of smaller sample size from the same dataset, and the results were more expected with around 200 significant associations found.

Below is my pseudo code, any suggestion/ advice would be appreciated. Thank you!

plink2 \
  --bfile bfile \
  --mac 100 --geno 0.1 --hwe 1e-15 \
  --mind 0.1 \
  --keep eid.txt \
  --write-snplist --write-samples --no-id-header \
  --out qc_pass

# Total genotyping rate is 0.969388.
# 784256 variants and 488377 people pass filters and QC.

plink2 \
  --bgen chr${chr}.bgen ref-first \
  --sample chr${chr}.sample \
  --keep eid.txt \
  --mind 0.1 --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-5 \
  --export bgen-1.2 --out QCed_chr${chr} \
  --memory 8000 require

# A total of 9255791 variants from chromosomes 1-22 and 259386 samples remained after filter

regenie \\
  --step 1 \\
  --bed ${PLINK_DATA_PREFIX} \\
  --phenoFile $PHENO_FILE \\
  --extract qc_pass.snplist \\
  --bsize 1000 \\
  --niter 30 \\
  --threads 16 \\
  --lowmem \\
  --lowmem-prefix ${TEMP_DIR}/pred \\
  --out step1 \\

# Fitting null model
#  * bim              : [bfile.bim] n_snps = 784256
#    -keeping variants specified by --extract
#    -number of variants remaining in the analysis = 589385
#    -keeping and mean-imputing missing observations (done for each trait)
#    -number of phenotyped individuals  = 258874
#  * number of individuals used in analysis = 258874

regenie \\
  --step 2 \\
  --bgen QCed_chr${chr}.bgen \\
  --sample QCed_chr${chr}.sample \\
  --ref-first \\
  --phenoFile $PHENO_FILE \\
  --chr ${chr} \\
  --pred $STEP1_PRED_FILE \\
  --bsize 400 \\
  --threads 8 \\
  --gz \\
  --out step2_chr${chr} \\

# Association testing mode with multithreading using OpenMP
#  * bgen             : [QCed_chr1.bgen]
#    -summary : bgen file (v1.2 layout, zlib compressed) with 259386 named samples and 715235 variants with 16-bit encoding.
#    -keeping variants specified by --extract
#    -sample file: QCed_chr1.sample
#    -keeping only individuals specified by --keep
#  * phenotypes       : [phenofile] n_pheno = 10
#    -number of phenotyped individuals  = 258874
#  * number of individuals used in analysis = 258874
#  * # threads        : [8]
#  * block size       : [400]
#  * # blocks         : [1787]
#  * approximate memory usage : 2GB
#  * using minimum MAC of 5 (variants with lower MAC are ignored)
#  * user specified to test only on select chromosomes

regenie plink gwas • 1.2k views

ADD COMMENT • link 7 months ago by cwwong13 ▴ 40

0

Entering edit mode

Hi! Just jumping in with a suggestion rather than an answer, could it be that the p-values you're getting are not adjusted for multiple comparisons?

ADD REPLY • link 7 months ago by SushiRoll ▴ 120

0

Entering edit mode

with a suggestion rather than an answer

Yet you chose to add an answer rather than a comment. I've moved it to a comment now, please be more mindful in the future.

ADD REPLY • link 7 months ago by Ram 43k

0

Entering edit mode

Thanks SushiRoll I was told to use standard GWAS statistical significance cutoff which is 5e-8. I know this might be too "nonconservative". I am also confused on whether I should use the crude p-value or the FDR corrected "q-value" when compare against the 5e-8 cutoff?

ADD REPLY • link 7 months ago by cwwong13 ▴ 40

score 0 · Answer 1 · 2023-09-22

0

Entering edit mode

7 months ago

LChart 3.9k

You don't appear to be specifying a covariate file anywhere. Are the other groups also excluding potentially confounding information like sex, self-reported ethnicity, and age? Have you looked for and excluded relatives? A random-effect model like REGENIE is good, but it's not perfect, and may only partially remove effects of population or sex stratification, particularly if the phenotype distribution has different moments in different populations.

ADD COMMENT • link 7 months ago by LChart 3.9k

0

Entering edit mode

Thanks LChart and sorry for the confusion. Covariates like age and sex were adjusted for when the phenotypes were pre-processed, and the data is from a single ethnicity with related individuals excluded (leaving only individuals in eid.txt). Phenotype files are inverse normal transformed trait residuals.

ADD REPLY • link 7 months ago by cwwong13 ▴ 40

0

Entering edit mode

You still definitely need to include principle components as covariates, even if your data is from the same ethnicity. Not including those is almost certainly going to contribute towards p-value inflation.

ADD REPLY • link 7 months ago by 4galaxy77 2.8k

0

Entering edit mode

Thanks 4galaxy77 . The first 10 PCs were included in the calculation of the trait residuals, but still results were highly inflated.

ADD REPLY • link 7 months ago by cwwong13 ▴ 40