Hello! I'm trying to understand what is the best algorithm for GWAS nowadays. I know we have many tools available like Plink and Hail, but currently, what is the best algorithm if I won't use any them? Let's say, write down a script in R or Python from scratch. Which statistical algorithm should I use? Is it linear mixed models (LMMs)? I'm confused as we can have binary phenotypes (case/control) or quantitative phenotypes. LMM seems to address quantitative ones, but can it be used for case/control as well? Actually, what is the state of the art for both/each of them? Pair-reviewed papers as references will be appreciated. Thanks!
The main regression executed by Plink was introduced by EIGENSTRAT in ~2006; see https://www.nature.com/articles/ng1847 . This is actually straightforward to write in R/Python from scratch; the harder part is optimizing the implementation for large datasets.
The Firth regression added to Plink 2.0 to improve handling of rare variants and imbalanced binary phenotypes was motivated by https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4049324/ .
Mixed linear models provide better statistical power when you have lots of close relatives in your dataset, but are much trickier to solve; actually, this is still a significant research area. Two tools covering parts of the current state-of-the-art are SAIGE (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6119127/ ; handles imbalanced binary phenotypes, but relatively slow) and fastGWA (https://www.nature.com/articles/s41588-019-0530-8 ; great speed, but doesn't support dosage data yet and uses a misspecified model for binary phenotypes).