Question: How to decide which Principal Components are the most correlated with with covariates for plink regression?
gravatar for anamaria
8 weeks ago by
anamaria100 wrote:


I am planning to run regression in plink via: --glm genotypic And I am wondering should I include all these covariates. So how would I find out which PCs are most strongly correlated with my covariates: td, array, HBA1C, age.... I should add that this is case-control study. The reasoning behind this is that if say PC4 turns out to be most strongly correlated with "pheno" case status (2/1) (not independant of it) then the adjustment for the PC4 will also remove some of the treatment effect.

and my covariate file looks like this:

FID IID pheno sex age PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 td array HBA1C
1000025 1000025 -9 Female 58 -13.0692 3.82546 -2.80895 -0.120865 -9.5369 2.25826 0.975514 -2.22046 1.15324 3.67727 -9 Biobank 29.8
 1000056 1000056 -9 Female 57 -13.0419 4.07604 -2.35979 2.32617 -0.73101 -1.56894 1.64281 -2.55201 -2.5772 0.111892 -9 Bileve NA
1000089 1000089 -9 Female 66 -9.557 4.27054 -0.91537 -0.160893 -2.27807 -1.08545 -0.551797 0.360637 0.30142 -1.28888 -9 Biobank 28.1
 1000093 1000093 1 Female 65 -11.4407 0.716765 -0.0932982 -1.22921 2.56851 -0.0708945 2.841 1.39155 -5.13552 -1.97563 2 Biobank 65.3
 1000115 1000115 -9 Male 51 -15.0062 6.97111 -2.93247 -3.03366 -3.73211 1.80292 0.71272 1.93733 -1.83419 -1.46217 -9 Biobank 30.7

I guess I can use something like this can get correlation coefficients:


Say that I have high correlation coefficient between PC4 and pheno, should I not include PC4 in my analysis?

Is it better to first say do "scree" plot in R and determine which PCs to keep and than include only those in my covariate file and do the above analysis via pairs.panels()?

Any thoughts would be highly appreciated!

ADD COMMENTlink modified 8 weeks ago by Kevin Blighe65k • written 8 weeks ago by anamaria100
gravatar for Kevin Blighe
8 weeks ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

Most use PCs for the purpose of adjusting for population stratification in the cohort, and, even in this case, there are no standard ways to decide on which PCs to include. I have come across researchers who just blindly include any number of PCs without even checking that they are accounting for population structure.

You seem to want to include PCs that are statistically significantly correlated to your covariates? You can feasibly perform PCA on your covariates and use the resulting PCs as surrogate un-correlated covariates in your case-control analysis. In this case, you could simply choose the number of PCs [to include] based on how many account for >80% total explained variation, or indeed regress each against your phenotype and include the ones that are statistically significantly associated with the phenotype.

Pairwise correlations could help, too.

HBA1C - glycohaemoglobin? I guess that you're studying diabetes or some liver disease.


ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Kevin Blighe65k

HI Kevin,

Thank you so much for getting back to me.

So this is what I did:

I extracted just my covariates in data frame (without previously calculated PCs for my UKB data)

and it looks like this:

    > head(a)
       sex age td array HBA1C
    1:   1  58 -9     1  29.8
    2:   1  57 -9     2   0.0
    3:   1  66 -9     1  28.1
    4:   1  65  2     1  65.3
    5:   2  51 -9     1  30.7
    6:   2  64  2     1   0.0

Then I calculated PCs only for those covariates and determined the proportion of variance explained via:

scaled_df <- apply(a, 2, scale)
a.cov <- cov(scaled_df)
a.eigen <- eigen(a.cov)
PVE <- a.eigen$values / sum(a.eigen$values)

[1] 0.2750932 0.2031598 0.1936860 0.1867781 0.1412830

you can see that none of the covariates (sex age td array HBA1C) contributes largely to variance...Does that mean I keep all of them?

Also what about calculating Principal components for my genomic data ( I already got calculated PCs for that from UKB) how do I select there how many to keep? Is "scree" plot in R valuable estimation there?

And yes it is Diabetic Retinopathy :)

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by anamaria100

on another hand if I keep my whole covariate file with all 10 PCs and all 6 covariates inclusing "pheno" (makes it total to 16) and I calculate correlations between them, none of them is passing 0.7

a=fread("pheno_covar_modified.txt", header=T)
d_cor <- as.matrix(cor(a[,1:16]))
d_cor_melt <- arrange(melt(d_cor), -abs(value))
thresold = 0.6
b=filter(d_cor_melt, abs(value) > thresold  & value !=1)

    > b
   Var1 Var2     value
 1  PC5  PC4 0.6472917
 2  PC4  PC5 0.6472917

So what would mean I keep all coavriates and all PCs?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by anamaria100

Hi again, I am not sure that there is much justification for calculating PC covariates for these (sex, age, td, array, and HBA1C). For example, some of these should be categorical variables, but PCA will regard them as continuous variables. I thought that you had other types of variables in your dataset.

I would just literally determine which of these are potential confounders via independent models, and then only include those that are statistically significant:

glm(phenotype ~ sex, ...)
glm(phenotype ~ age, ...)
*et cetera*
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Kevin Blighe65k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1188 users visited in the last hour