Question: Qtl Analysis Using Principal Components As Covariates
gravatar for John
7.0 years ago by
United States
John70 wrote:

Hi all,

I am wondering what is the reason for including principal components as covariates in QTL analysis? And how to determine the number of PCs to include? For example, the following is a short text from a paper. I understand that by including imputation status, we can adjust for potential biases of imputation. But what do PCs eliminate? Thank you in advance!

The details of sample sets, data filtering and normalization are discussed above. Briefly, we did transcriptome QTL mapping separately for European (n=373) and Yoruba (n=89) populations. We used genetic variants with MAF>5% in either EUR or YRI <1MB from transcription start site, with covariates of imputation status (0|1), PCs 1-3 for Europeans and PCs 1-2 for Yoruba.

rna-seq eqtl • 4.3k views
ADD COMMENTlink modified 2.7 years ago by geek_y11k • written 7.0 years ago by John70

Check out this paper: "Principal components analysis corrects for stratification in genome-wide association studies."

ADD REPLYlink written 7.0 years ago by matted7.3k

Thanks for the paper!

ADD REPLYlink written 7.0 years ago by John70
gravatar for David Quigley
7.0 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

It would help if you provided the citation, but most likely the authors are attempting to minimize the effects of cryptic (i.e. unwanted and unplanned) genetic diversity as a confounder with their eQTL study. If you mean to sample people from a single population and perform an association test against the genotypes from that population, you'd like the only thing affecting the dependent variable to be the genotype and other "official" covariates. However, in a population-based sample you can get subgroups of subjects who have systematic differences in their genetic structure. Let's say you have patients from the North and from the South, and patients within a geographic group are more similar to each other genetically than they are to patients in the other group. Some of the time these differences will co-vary with your dependent variable, misleading you about the effects of a given genotype. One way this can happen is if the two populations have different minor allele frequencies for a given locus or set of loci, and within these populations there is no association with the dependent variable. However, if the variable is associated in some way with the cryptic populations, you might think the specific genotypes are associated with the variable instead of the populations as a whole. Another case is where you think you have patients from a single ethnic background (and therefore with a genetic background that has a given degree of similarity) but there is a minority population that contains significant genetic contribution from some other ethnicity. Usually you'd like to remove those effects as best you can in order to test only the effects of genotype on your dependent variable.

The PCA in this case is an attempt to account for the greatest sources of undesired variance in the genotype data, thus reducing the effect of cryptic diversity. You would probably test empirically for the "correct" number of PCs to adjust for; I don't know if there is an established dogma about this, or if it's just part of the practice of genetic epidemiology that you would look for PCs that appear to be affecting the analysis and attempt to remove them.

ADD COMMENTlink written 7.0 years ago by David Quigley11k

Thank you so much for the thorough answer! The paper I am referring to is "Transcriptome and genome sequencing uncovers functional variation in humans". I now understand the rationale behind using PCA in QTL analysis, but would regressing out the PCs that are associated with the quantitative trait accidentally remove true biological effect? What if the associated PC contain true signal.

ADD REPLYlink written 7.0 years ago by John70

That's a real possibility, but when you do a genome-wide study with 500,000+ SNPs, false positives are a major problem and most groups would prefer to be conservative and greatly reduce bogus signal at the risk of reducing the amount of variance correctly explained by their model.

ADD REPLYlink written 7.0 years ago by David Quigley11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1444 users visited in the last hour