I have a variant dataset for 95 plant genomes based on 30x coverage whole-genome resequencing. I use the --indep-pairwise command in PLINK 1.9 to filter variants based on LD within a window of 20,000 variants and run PCA's. My PCA plots show clear differentiation between groups of individuals so I want to assess population structure and admixture in ADMIXTURE.
Running ADMIXTURE 1.3 for various numbers of populations (k = 1:9) the populations recovered match those found in the PCA. However, which k has the lowest cross-validation error is very much dependent on the LD threshold used for filtering. Below is a plot showing cross-validation errors for k=1:9 based on three differently filtered datasets. As you can see, with the largest dataset based on an LD threshold of R^2 = 0.5 (n=114,953 variants), the optimal k = 1. The smaller dataset based on r^2 = 0.3 (n=14,574) has optimal k = 2. The smallest dataset based on R^2 < 0.1 (n=2,034) has optimal k = 4. Also, overall cross-validation errors go up with reducing LD thresholds.
I am not a population geneticist but I find this highly surprising. Why would including more variants in LD reduce both the cross-validation errors and the population structure? I would expect that linked variants would boost any population-level signal rather than reduce it. If anyone can explain these patterns to me I would be much obliged.
More practically, how can I determine the optimal filtering and k for my dataset?
Many thanks in advance.