I'm new to GWAS analysis and while reading through the QC documentation for the UK Biobank project, I came across this part, adjusting for population substructure before flagging samples with unusual heterozygosity.
A4 Accounting for the heterozygosity bias explained by population structure
Heterozygosity (computed from either autosomal or X-chromosome SNPs) is sensitive to
population structure because of ascertainment bias: a majority of SNPs on the UK Biobank Axiom > array were chosen to satisfy certain properties imputation coverage, for example in European >populations. Here we describe the details of a regression model to adjust heterozygosity by accounting for the effects of population structure.
Let h denote the heterozygosity and let x be a set of features correlated with ancestry.
We used the projections onto the four major UK Biobank principal components to characterise ancestry, writing x = (x1, x2, x3, x4) for these four principal component values. Consider the following model for heterozygosity under population structure:
h(x) = h0 + β(x)
h(x) is the raw heterozygosity, which depends on the features x, h0 is the ancestry-adjusted heterozygosity and β(x) is a bias term due to population structure.
We chose a quadratic form for β(x), which includes all linear and quadratic terms xi and xi^2 as well as all cross terms xixj, and we estimated h0 with ordinary least squares. More > specifically, the bias was assumed to have the following functional form:
β(x) = β11x1^2 + β22x2^2 + β33x3^2 + β44x4^2 + β1x1 + β2x2 + β3x3 + β4x4 + β12x1x2 + β13x1x3 + β14x1x4 + β23x2x3 + β24x2x4 + β34x3x4
The fitted value ĥ0 is the ancestry-corrected heterozygosity.
From my interpretation, it appears like they are doing a PCA to select the PCs that show the bulk of the population variance and do a regression analysis using the PCs as the regression coefficients and the heterozygosity per sample as the outcome and call the fitted values from the model as the ancestry corrected heterozygosity values.
Coming from an engineering background, I learned regression analysis from a machine learning perspective to produce predictions, so I wanted to ask if there is a statistical reasoning for this or is this just a neat hack?