I am using GEMMA to run a BSLMM to test for SNP associations with a phenotype. I ran 10 independent runs of the model and checked the traces of PVE for convergence. I expected the trace to look like a 'caterpillar' centered around a median value, but instead the traces all hang from PVE = 1. Some runs get stuck around 1 for a bit (see top right panel in image below), but most vascillate from 1 across most of the parameter space throughout the chain.
I'm not sure where to start troubleshooting this problem, I didn't find any pointers in the GEMMA documentation, the GEMMA github, or a google search. These models take a ton of computation time to fit, so any advice on how to effectively troubleshoot is very appreciated! Some additional info: my sample size is 136 individuals with around 5 million SNPs, I ran the MCMC for 25 million iterations with 5million discarded as burn-in. Thank you for your help!
Thank you very much for the answer, this is very helpful! Regarding the PCA I have a couple questions:
I used BEAGLE to perform imputations for missing genotypes, I'm assuming it be more appropriate to perform the PCA on the imputed data (vs. the original data)?
Does a genomic PCA does a better job accounting for population structure than the relatedness matrix calculated by gemma -bfile (or, at least in the case of small sample sizes/highly structured populations)?
Excellent follow-up questions. Regarding your first point, you should definitively perform the PCA on the post-imputation dataset. Because principal components are derived from the covariance structure of the genotype matrix, using the complete, imputed data will yield more robust and reliable eigenvectors that accurately reflect population stratification rather than technical artifacts arising from missing data patterns. Concerning your second question, PCA and the Gemma-calculated genomic relatedness matrix (GRM) are not interchangeable; they are complementary components designed to control for different sources of confounding. The GRM models the polygenic background and cryptic relatedness as a random effect, while the top PCs are incorporated as fixed effects specifically to account for deep, systematic population structure, which is the likely cause of your PVE boundary issue. The standard, robust protocol is to employ both simultaneously.
Thank you! I am now running into an issue with implementation. GEMMA does not allow covariate files for BSLMM (it does for LMM). I found a section of the manual that states
I have been using the PLINK format for my SNP and phenotype data and had prepared the PC covariates as a .bam file, analogous to the phenotype data (one row per individual, with PC eigenvectors in each of 10 columns for the top 10 PCs.) I am a bit confused about how the PC eigenvector can be converted to BIMBAM mean genotype file, as this file format has rows for each SNP x individual, including the SNP ID, minor and major alleles as columns. So I'm not sure how I can convert phenotype data into this format or whether that would be appropriate.