Question: PCA cannot separate different breeds
gravatar for micro32uvas
11 months ago by
micro32uvas10 wrote:

Hello everyone,

I have WGS data of 118 samples, out of which 72 are from one country, consisting of 11 breeds , that are under consideration. I used others to distinguish/ control. After running the NJ tree from the called SNP data, I ran PCA which separates 72 and 46 (The 46 clearly clusters into 4 groups), but rest of the 72 makes only one scattered cluster. So in the next step, I only take these 72 to make another PCA, resulting in a total of three clusters (1 breed, 1 partial hybrid breed and rest is a scattered mush).

the maf was calculated as 1/2n. Following are the command lines used to produce the PCA:

plink --bfile out.all --keep keep --maf 0.00423 --make-bed --chr-set 29 --out out 
plink --bfile ./out.all --indep-pairwise 50 5 0.2 --chr-set 29 --out out 
plink --bfile ./out.all --extract --make-bed --chr-set 29

Any help is very much appreciated. Awaiting.

tree snp admix pca wgs • 431 views
ADD COMMENTlink modified 11 months ago by RamRS27k • written 11 months ago by micro32uvas10

How were these samples collected and processed before variant calling? What variant calling filters did you use? There could be alternative sources of variation that are taking over your first two principal components.

What percentage of variation do your PCs explain? If this number is quite low, it is worth trying other methodologies for composition (like tree building).

ADD REPLYlink modified 11 months ago • written 11 months ago by Ace70
gravatar for Carambakaracho
11 months ago by
Carambakaracho2.2k wrote:

Hi micro32uvas, IMO, you lack a specific question. Is the question why does the PCA not separate clusters? Or rather how can I improve variant calling? Why would you think this doesn't reflect your data in the first place?

FMPOV, all looks good. given the limited resources you provide. You have a separation according to geographical origin, and one of the clusters has subclusters (as many as you suspect?). Maybe this suggests that your samples come from less isolated or purebred lineages than you might think.

ADD COMMENTlink written 11 months ago by Carambakaracho2.2k

Thank you very much Carambakaracho for your response. The main question is about the PCA and the clustering related to it. The variant calling is fine, as i have followed the 1000bull genome pipeline and it has worked well so far. The problem starts when the 11 breeds cover an entire country with very rich cultural civilisations and history. The breeds phenotype to their productivity traits seperate them into atleast 3 groups but in the current results, they are random cluster.

For example if i want to zoom out, i would include some breeds from other countries, All i want is to zoom in to expand this big cluster. All the pure bred cattles cannot be hybrid.

I have done PCA with exotic cattles as well as only my samples alone. But the single cluster is not breaking.

Help is greatly appreciated.

ADD REPLYlink written 11 months ago by micro32uvas10

I have done PCA with exotic cattles as well as only my samples alone. But the single cluster is not breaking.

The data is what it is. The fact that you are getting clusters in your initial analysis suggests there is likely nothing wrong with the way you are doing it. I suggest you check out how much variance is explained by our first 3 PCA components. If the sum of those is below 30-40%, it is no surprise that you get diffuse clustering.

While there is no magical way of extracting clusters from data if there aren't any, t-SNE often embeds data in a more visually appealing way than PCA. See also here and you may want to check out UMAP.

Because it applies here, and also to show you how to include your PCA image:

<strong>this quote</strong>

As a general rule, images are shared by uploading them to an external site and providing a link.

ADD REPLYlink written 11 months ago by Mensur Dlakic5.8k

Thank you Mensur Dlakic, And yes the sum of all the three components is ~9-10%. I am unaware of t-SNE but i will definately look into it. and update her, if I get any difference.

ADD REPLYlink written 11 months ago by micro32uvas10

But the single cluster is not breaking

because the members are very close to each other in PCA space (did you try using more than two components for clustering?), meaning that they can't be distinguished based on genetic variability as captured by your data. Without access to the data, it's difficult to give a detailed explanation. If I understand correctly, you expect 3 phenotypic groups of the 11 breeds to have shared evolutionary history. While this seems like a reasonable hypothesis, it would seem that your data doesn't support it. However, it's also possible that only a few SNPs are informative and their signal is lost in all the other irrelevant ones. For an example of selecting SNPs in a similar context, have a look at this paper.

ADD REPLYlink modified 11 months ago • written 11 months ago by Jean-Karim Heriche22k

Thank you Jean-Karim Heriche for the guidance. I used three components PC1, PC2, PC3 to plot three kinds of PCA but results were comparatable. As per the paper and your suggestion, if I increase the LD or other parameters for SNP selection, could I still seperate them? Could you explain a bit more? I can share the PCA but the picture cannot be uploaded here.


ADD REPLYlink written 11 months ago by micro32uvas10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1140 users visited in the last hour