Hi, I'm trying to run DAPC analysis using adegenet package. I used dataset including 145 bacteria strains isolated from 6 parts of one country. I used Snippy to extract SNP, as a result, 238565 SNPs were detected.
Then I run DAPC using following code.
x <- read.vcfR("aaa.vcf", verbose=F)
y <- vcfR2genind(y, ploidy=1)
y@pop <- as.factor(z) # z is vector including population data.
grp <- find.clusters(y, max.n.clust=10)
dapc1 <- dapc(y, grp$grp)
But, variance explained by PCA plot was like straight.
After filtering vcf files, the result was same.(using vcftools with --maf 0.10)
238565SNPs -> 43306 SNPs
Then I obtained following scatterplot after cross-validation to decide the number of retained PCs. The result of cross-validation was
$Median and Confidence Interval for Random Chance
2.5% 50% 97.5%
0.2830688 0.3253968 0.4191270
$Mean Successful Assignment by Number of PCs of PCA
10 20 30 40 50 60 70 80
1.0000000 1.0000000 0.9992063 0.9968254 0.9992063 0.9928571 0.9944444 0.9809524
90 100 110 120
0.9857143 0.9547619 0.9000000 0.8023810
$Number of PCs Achieving Highest Mean Success
1 "10"
$Root Mean Squared Error by Number of PCs of PCA
10 20 30 40 50 60 70
0.000000000 0.000000000 0.004347004 0.008694009 0.004347004 0.013041013 0.014417383
80 90 100 110 120
0.024590370 0.023002185 0.051617818 0.110143176 0.203660753
$Number of PCs Achieving Lowest MSE
1 "20"
I decided to retain 20 PCs.
The inferred groups seemed to be plausible.
The question is
・Variance explained by PCA plot was like straight and each clusters were highly packed. Does this depends on my data? Ia there any improvement plans?
・How can I select group type(original groups or inferred groups) when I run DAPC? When do we run dapc using original groups?
Thanks,