Question

How to troubleshoot strange hyperparameter trace from BSLMM (GEMMA)

2

Entering edit mode

10 weeks ago

gina.maria ▴ 60

I am using GEMMA to run a BSLMM to test for SNP associations with a phenotype. I ran 10 independent runs of the model and checked the traces of PVE for convergence. I expected the trace to look like a 'caterpillar' centered around a median value, but instead the traces all hang from PVE = 1. Some runs get stuck around 1 for a bit (see top right panel in image below), but most vascillate from 1 across most of the parameter space throughout the chain. example traces of PVE from 4 runs

I'm not sure where to start troubleshooting this problem, I didn't find any pointers in the GEMMA documentation, the GEMMA github, or a google search. These models take a ton of computation time to fit, so any advice on how to effectively troubleshoot is very appreciated! Some additional info: my sample size is 136 individuals with around 5 million SNPs, I ran the MCMC for 25 million iterations with 5million discarded as burn-in. Thank you for your help!

gemma gwas bslmm • 2.5k views

ADD COMMENT • link 10 weeks ago by gina.maria ▴ 60

score 2 · Accepted Answer · 2025-08-14

2

Entering edit mode

10 weeks ago

Aleksandra ▴ 190

The observed PVE trace, which repeatedly hits its upper boundary at 1.0, is a classic indicator of severe model misspecification rather than a failure of MCMC convergence to its stationary distribution. This behavior is most commonly induced by massive confounding from uncontrolled population structure, a problem that is severely exacerbated when a small sample size (N=136) is confronted with a high marker density (p=5M). Your immediate priority must be to explicitly model this structure by incorporating the top genetic principal components as covariates. Additionally, you must verify that stringent quality control has been applied to your SNP data and perform an inverse normal transformation on your phenotype if it deviates significantly from normality, as extending MCMC iterations cannot fix a model that fundamentally violates its assumptions.

ADD COMMENT • link 10 weeks ago by Aleksandra ▴ 190

2

Entering edit mode

Thank you very much for the answer, this is very helpful! Regarding the PCA I have a couple questions:

I used BEAGLE to perform imputations for missing genotypes, I'm assuming it be more appropriate to perform the PCA on the imputed data (vs. the original data)?

Does a genomic PCA does a better job accounting for population structure than the relatedness matrix calculated by gemma -bfile (or, at least in the case of small sample sizes/highly structured populations)?

ADD REPLY • link 10 weeks ago by gina.maria ▴ 60

1

Entering edit mode

Excellent follow-up questions. Regarding your first point, you should definitively perform the PCA on the post-imputation dataset. Because principal components are derived from the covariance structure of the genotype matrix, using the complete, imputed data will yield more robust and reliable eigenvectors that accurately reflect population stratification rather than technical artifacts arising from missing data patterns. Concerning your second question, PCA and the Gemma-calculated genomic relatedness matrix (GRM) are not interchangeable; they are complementary components designed to control for different sources of confounding. The GRM models the polygenic background and cryptic relatedness as a random effect, while the top PCs are incorporated as fixed effects specifically to account for deep, systematic population structure, which is the likely cause of your PVE boundary issue. The standard, robust protocol is to employ both simultaneously.

ADD REPLY • link 10 weeks ago by Aleksandra ▴ 190

2

Entering edit mode

Thank you! I am now running into an issue with implementation. GEMMA does not allow covariate files for BSLMM (it does for LMM). I found a section of the manual that states

"GEMMA does not take covariates file when fitting BSLMM. However, one can use the BIMBAM mean genotype file to store these covariates and use “-notsnp” option to use them." --pg. 24

I have been using the PLINK format for my SNP and phenotype data and had prepared the PC covariates as a .bam file, analogous to the phenotype data (one row per individual, with PC eigenvectors in each of 10 columns for the top 10 PCs.) I am a bit confused about how the PC eigenvector can be converted to BIMBAM mean genotype file, as this file format has rows for each SNP x individual, including the SNP ID, minor and major alleles as columns. So I'm not sure how I can convert phenotype data into this format or whether that would be appropriate.

ADD REPLY • link 10 weeks ago by gina.maria ▴ 60