Question

Can we draw PCA with genotyping data align with geographic data for two different populations in different locations?

0

Entering edit mode

4.0 years ago

Emy ▴ 50

The analysis of population structure has many methods that one of them is PCA in population genetic research. You know, "genotype-environment correlation" is really important. For example, we are two populations (breed) in two different locations with their genotyping and geographic data (e.g. elevation, latitude, longitude, temperature, rainfall, and so on). However, why researchers don't combine geographic data with genotyping data for calculating PCA? What is your opinion about this subject? or Do you read papers on this subject?

I will really appreciate if you guide me.

SNP ChIP-Seq genome • 918 views

ADD COMMENT • link 3.9 years ago by Emy ▴ 50

0

Entering edit mode

Hello Kevin, Thank you for your prompt reply and sorry for my answer to you late. I saw a nearly related paper about my question but in GWAS, not PCA. In this paper, they represented Gene-by-environment interactions as d parameters in GWAS model. But why researchers don't consider it on their PCA? Can we consider environmental factors as eigenvector in PCA?

This paper: François, O., & Caye, K. (2018). Naturalgwas: An R package for evaluating genomewide association methods with empirical data. Molecular ecology resources, 18(4), 789-797.

Sincerely

ADD REPLY • link 3.9 years ago by Emy ▴ 50

0

Entering edit mode

Can we consider environmental factors as eigenvector in PCA?

Hi, I am not too certain about what you mean? Se puede explicar en español o portugués?

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

score 0 · Answer 1 · 2020-05-02

The purpose of performing PCA in this context is usually to produce covariates that can control for population stratification in later models, such as regression modeling against the outcome variable. It is important to do this such that no 'spurious' / false-positive associations are made:

glm(outcome ~ SNP + PC1 + PC2 + ... PCn)

For example, if you consider the 1000 Genomes Phase III dataset:

In this particular plot, if my memory serves me correctly, there are genotypes at ~12000 positions that are being used. If we wanted to conduct a study exploring diabetes across the global population, we would have to control for this stratification of the population groups by including one or more principal components as covariates.

Kevin