Question: Can we draw PCA with genotyping data align with geographic data for two different populations in different locations?
gravatar for XYZeinab
6 months ago by
XYZeinab0 wrote:

The analysis of population structure has many methods that one of them is PCA in population genetic research. You know, "genotype-environment correlation" is really important. For example, we are two populations (breed) in two different locations with their genotyping and geographic data (e.g. elevation, latitude, longitude, temperature, rainfall, and so on). However, why researchers don't combine geographic data with genotyping data for calculating PCA? What is your opinion about this subject? or Do you read papers on this subject?

I will really appreciate if you guide me.

snp chip-seq genome • 246 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by XYZeinab0

Hello Kevin, Thank you for your prompt reply and sorry for my answer to you late. I saw a nearly related paper about my question but in GWAS, not PCA. In this paper, they represented Gene-by-environment interactions as d parameters in GWAS model. But why researchers don't consider it on their PCA? Can we consider environmental factors as eigenvector in PCA?

This paper: François, O., & Caye, K. (2018). Naturalgwas: An R package for evaluating genomewide association methods with empirical data. Molecular ecology resources, 18(4), 789-797.


ADD REPLYlink written 6 months ago by XYZeinab0

Can we consider environmental factors as eigenvector in PCA?

Hi, I am not too certain about what you mean? Se puede explicar en español o portugués?

ADD REPLYlink written 6 months ago by Kevin Blighe67k
gravatar for Kevin Blighe
6 months ago by
Kevin Blighe67k
Republic of Ireland
Kevin Blighe67k wrote:

The purpose of performing PCA in this context is usually to produce covariates that can control for population stratification in later models, such as regression modeling against the outcome variable. It is important to do this such that no 'spurious' / false-positive associations are made:

glm(outcome ~ SNP + PC1 + PC2 + ... PCn)

For example, if you consider the 1000 Genomes Phase III dataset:


In this particular plot, if my memory serves me correctly, there are genotypes at ~12000 positions that are being used. If we wanted to conduct a study exploring diabetes across the global population, we would have to control for this stratification of the population groups by including one or more principal components as covariates.


ADD COMMENTlink written 6 months ago by Kevin Blighe67k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1850 users visited in the last hour