Question

Which GWAS tool to use that can work on Czech cases and British controls, on a binary trait?

1

Entering edit mode

5 months ago

anita.szabo08 ▴ 10

Hi,

I am running a GWAS on 482 Czech cases and 1928 British controls for a binary trait (1:4 case - control ratio) (Attached is their PCA plot, PC1 against PC2 showing clear separation).

I ran "plink2 --glm" with various included covariates checking how the lambda value and number of significant snps change. ( Attached as "gwas_summary.txt"). For one example, I am also attaching the log file when PC1 + PC2 was used, and the resulting QQ-plot.

If PC2 is included than basically I receive "UNFINISHED" error code for all 5.8 million included snps.
The first few PC-s seem important given the difference in EUR subpopulations between the cases and controls , but then I do not receive a result if they are included in the analysis.

I see that the Czech is a more homogeneous cohort compared to the diverse British - where SNP frequency difference could arise from population structure beside being affected by the disease.

What could be a solution that is computationally efficient? ( using "logistf" R package with increased iteration number seems a good method for a subset of snps but maybe not genome-wide?)

Would you recommend another GWAS tool to use?

Best wishes, Anita

PLINK v2.0.0-a.6.12LM AVX2 AMD (20 Apr 2025)
Options in effect:
  --ci 0.95
  --covar /media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/QC/Czech/czech_age_sex_10_PC.txt
  --covar-name PC1 PC2
  --extract-if-info R2 >= 0.8
  --glm hide-covar firth-fallback
  --keep /media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/QC/Czech/matched_controls_czech_cases.txt
  --out /media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/output/czech_pca_czech_cohort_matched_controls/pc1.2/output/gwas_results
  --pfile /media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/QC/Czech/main_cohort
  --pheno /media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/QC/phenotypes.txt
  --pheno-name PHENO
  --remove /media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/0.samples_info/remove_samples.txt
  --seed 1
  --threads 1

Hostname: overdrive
Working directory: /media/pontikos_nas2/AnitaSzabo/phd_projects/overdrive_scripts/temp_phd/gwas_2024/genotype_to_gwas_scripts
Start time: Sun May 25 20:51:36 2025

257622 MiB RAM detected, ~218229 available; reserving 128811 MiB for main
workspace.
Using 1 compute thread.
16416 samples (0 females, 0 males, 16416 ambiguous; 16416 founders) loaded from
/media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/QC/Czech/main_cohort.psam.
5800849 out of 5818824 variants loaded from
/media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/QC/Czech/main_cohort.pvar.
1 binary phenotype loaded (482 cases, 15934 controls).
--keep: 2410 samples remaining.
--remove: 2410 samples remaining.
2 covariates loaded from /media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/QC/Czech/czech_age_sex_10_PC.txt.
2410 samples (0 females, 0 males, 2410 ambiguous; 2410 founders) remaining
after main filters.
482 cases and 1928 controls remaining after main filters.
Calculating allele frequencies... done.
5800849 variants remaining after main filters.
--glm logistic-Firth hybrid regression on phenotype 'PHENO': done.
Results written to /media/pontikos_nas2/AnitaSzabo/phd_projects/gwas_2024/analysis/5.gwas/pipeline_output/without_high_LD_regionin_PCA/output/czech_pca_czech_cohort_matched_controls/pc1.2/output/gwas_results.PHENO.glm.logistic.hybrid .

End time: Sun May 25 21:52:23 2025

pca plot qq plot covariates

plink2 gwas saige • 878 views

ADD COMMENT • link updated 5 months ago by LChart 5.1k • written 5 months ago by anita.szabo08 ▴ 10

score 4 · Answer 1 · 2025-05-29

There is really no way for you to do this analysis - having cases and controls perfectly confounded by populations means that every variant with an allele frequency difference between populations will appear to be associated with the trait. You need to add British cases and Czech controls, full stop. No software package will fix this.