Question

How to interpret these weird plots from find.clusters() function in adegenet package?

0

Entering edit mode

5.7 years ago

andre.floresb • 0

Hello,

I am trying to run DAPC analysis in my genome-wide dataset incluiding 188736 genotypes for 188 individuals from 18 different geographic populations. I already know there is some genetic structure in the dataset, at least 2 groups could be defined. However, when running "find.clusters()" function in order to define the most plausible number of groups that could explain my dataset I obtain strange plots of "Cumulative variance explained by PCA" and "Value of BIC vs number of clusters":

Cumulative_variance_explained_by_PCA

Value_of_BIC_vs_number_of_clusters

This is my R script:

library("adegenet")

snps <- read.PLINK(file = "file.raw", map.file = "file.map")

grp <- find.clusters(snps, max.n.clust = 36, n.iter=1000)

dapc1 <- dapc(snps, grp$grp)

scatter(dapc1)

When I perform the analysis with a small subset of my data, let's say 1000 SNPs, the analysis seems to run and I obtain normal plots, however a subset would not be representative of my dataset to perform this kind of analysis. That's why I was wondering if it could be an issue with the size of the dataset.

Do you have any idea about why obtaining these results and what do they actually mean? Could it be because the amount of genotypes and samples is such high that the function cannot work with them?

Thanks a lot in advance.

André

R adegenet find.clusters DAPC k-means • 2.6k views

ADD COMMENT • link updated 4.1 years ago by Kevin Blighe 87k • written 5.7 years ago by andre.floresb • 0

0

Entering edit mode

Hi André,

I am having the same problem with my data. Did you figure out if was something wrong with your data or sth about the script?

Thank you.

Sandara Brasil.

ADD REPLY • link 4.1 years ago by sandara.brasil • 0

score 1 · Answer 1 · 2020-03-14

This was also asked on ResearchGate and GitHub:

The first plot is commonly called a 'scree' plot, which just plots the cumulative explained variation per PC (principal component). The second [plot] highlights the loss of information (via the BIC criterion) when each PC is removed from the dataset, from what I understand.

You are seeing these plots likely because you are using too many 'uninformative' SNPs (188736) - by 'uninformative', I mean that, as a group / combined, they provide no useful information about structure in your dataset. You need to reduce this number of SNPs to a number of SNPs that are potentially informative of structure. If you want tips on how to identify such SNPs, take a look at step 7, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

You have to be aware of both MAF and LD. For example, low MAF variants provide no useful information on structure specifically due to the fact that they are rare. On the other hand, identifying and eliminating SNPs based on LD is useful to increase Power of identifying stucture.

Also, be sure to go through the vignette for this package: A tutorial for Discriminant Analysis of Principal Components (DAPC) using adegenet 2.0.0.

Kevin