Hello,
I'm currently using the software "ADMIXTURE" to calculate the most probable number of genetic groups (K) in a large panel of 38 landraces (old local populations) using high density, genome-wide distributed SNP markers.
For the estimation of K, I ran the calculations for K = 1 to 65 and plotted the cross-validation errors for each model. Unfortunately the CV error plot has ambiguous results, with a rather flat curve since K = 40. Therefore, there is no clear minimum! As I have included 38 landraces, my expectation would have been K = 38. But it seems that the CV error gets slightly (3rd or 4th decimal place) lower with each increase in K. Maybe it is impossible to define a best K for such a diverse panel with a too high number of different subpopulations.
I would be interested in the standard errors of the cross-validation error estimates. One should be able to calculate the variance and std. error, as it is a 5-fold cross validation. Is there any way to get an output for this CV std. error in ADMIXTURE?
I would then simply choose the most parsimonious model (lowest K) which is within the standard error of the best model (lowest CV error).
I appreciate any advice,
Manfred
Eventually , you can try penalized estimation (manual, page 11) recommended for large K values.
Thank you for this hint, Galina. It's definitely worth a try. Do you have any experience about that? I'm not sure what values I should take for lambda and epsilon. Maybe I have to try different values and take the one with the lowest CV error.