Question: Plink2 error: Skipping --glm regression on phenotype 'PHENO1' since variance inflation factor for covariate 'COVAR1' is too high.
0
gravatar for landscape95
18 months ago by
landscape95170
landscape95170 wrote:

I am calculating the SNP effect with plink2, I used the external phenotype instead of inside the fam file and the same file (with other columns) as the covariate file. May I ask why when I used the first code below with covariates as columns 7 to 11, it results in this message and then stopped while it can go through the other 2 codes for column 11 separately and columns 7-10. Is there any problem with the combination of covariates?

Start time: Mon Jan 21 19:25:22 2019
257682 MB RAM detected; reserving 128841 MB for main workspace.
Using up to 4 compute threads.
284516 samples (0 females, 0 males, 284516 ambiguous; 284516 founders) loaded
from ukbb_dis.fam.
1133273 variants loaded from ukbb_dis.bim.
1 quantitative phenotype loaded (284516 values).
3 covariates loaded from phenotype_dis.pheno.
Warning: Skipping --glm regression on phenotype 'PHENO1' since variance
inflation factor for covariate 'COVAR1' is too high. You may want to remove
redundant covariates and try again.
End time: Mon Jan 21 19:25:23 2019

First code which produced the above error:

./plink2 --bfile ukbb_dis --pheno phenotype_dis.pheno --pheno-col-nums 6 --covar phenotype_dis.pheno --covar-number 7-11 --input-missing-phenotype -10000000 --linear --adjust --out assoc_SNP_height_linear_plink2 --threads 4

The other 2 codes which work fine:

./plink2 --bfile ukbb_dis --pheno phenotype_dis.pheno --pheno-col-nums 6 --covar phenotype_dis.pheno --covar-number 7-10 --input-missing-phenotype -10000000 --linear --adjust --out assoc_SNP_height_linear_plink2 --threads 4

Or

./plink2 --bfile ukbb_dis --pheno phenotype_dis.pheno --pheno-col-nums 6 --covar phenotype_dis.pheno --covar-number 11 --input-missing-phenotype -10000000 --linear --adjust --out assoc_SNP_height_linear_plink2 --threads 4

These last 2 codes work with the current output below, it would take really long time

Start time: Mon Jan 21 19:46:53 2019
257682 MB RAM detected; reserving 128841 MB for main workspace.
Using up to 4 compute threads.
284516 samples (0 females, 0 males, 284516 ambiguous; 284516 founders) loaded
from ukbb_dis.fam.
1133273 variants loaded from ukbb_dis.bim.
1 quantitative phenotype loaded (284516 values).
4 covariates loaded from phenotype_dis.pheno.
--glm linear regression on phenotype 'PHENO1': 0%

Moreover, here are 2 lines of the phenotype_dis.pheno file The first six columns are as in the fam file, from column 7 to 11 are: batch, centre, year of birth, sex, age. This file has no header.

3319618 3319618 0   0   0   173 2000    11011   1959    1   49
4567961 4567961 0   0   0   181 19  11018   1949    1   60

Your help is really appreciated!

plink gwas • 1.6k views
ADD COMMENTlink modified 8 months ago by FractalDumpster0 • written 18 months ago by landscape95170
3
gravatar for chrchang523
18 months ago by
chrchang5237.1k
United States
chrchang5237.1k wrote:

There are several issues here.

  1. “—covar-number 7-11” actually refers to columns 9-13 in the file, since the first two columns are ID.
  2. Batch and centre are categorical covariates. If you want plink2 to interpret them properly, you need to give them names that start with a letter, e.g. batch2000, batch19. (warning: handling of categorical covariates hasn’t been optimized yet.) If you don’t want to do that, you’re better off just leaving out those covariates.
  3. There is often a variance inflation factor problem with year of birth. A simple workaround is to transform all covariates to have mean zero, variance 1 by adding —covar-variance-standardize.
ADD COMMENTlink written 18 months ago by chrchang5237.1k

Thank you very much for your help, Christopher!

ADD REPLYlink written 18 months ago by landscape95170

Hi Chang, I've done what you said but it says, may I need your support? Thank you very much!

Here is my code:

/home/joe/Tools/plink2/plink2 --bfile ukbb_dis_relationship --pheno Covariate_rel.tsv --pheno-name height --covar Covariate_rel.tsv --covar-name yob, sex, age --input-missing-phenotype -10000000 --linear --adjust --out rel_assoc_SNP_height_linear --covar-variance-standardize

And here is the output:

Start time: Sun Jan 27 15:52:03 2019
257682 MB RAM detected; reserving 128841 MB for main workspace.
Using up to 32 threads (change this with --threads).
73275 samples (0 females, 0 males, 73275 ambiguous; 73275 founders) loaded from
ukbb_dis_relationship.fam.
1133273 variants loaded from ukbb_dis_relationship.bim.
1 quantitative phenotype loaded (72978 values).
5 covariates loaded from Covariate_rel.tsv.
--covar-variance-standardize: 5 covariates transformed.
Warning: Skipping --glm regression on phenotype 'height' since variance
inflation factor for covariate 'yob' is too high. You may want to remove
redundant covariates and try again.
End time: Sun Jan 27 15:52:03 2019
ADD REPLYlink modified 18 months ago • written 18 months ago by landscape95170
1

Oh, I should have also noted that year-of-birth and age are almost totally redundant; you need to remove one of those covariates.

ADD REPLYlink written 18 months ago by chrchang5237.1k

Hi Chang, It seems like my unique values of centre data is less than the batch data. But I have no problem with batch but problem with centre, I still don't have any clue why it says Variance inflation factor is too high for to centre. Is there anyway to figure out it by hand/debug manually? Thank you

Warning: Skipping --glm regression on phenotype 'height' since variance
inflation factor for covariate 'centre=centre11009' is too high. You may want
to remove redundant covariates and try again.


> unique(a$batch)
 [1] "batch14"   "batch2000" "batch21"   "batch13"   "batch1"    "batch10"  
 [7] "batch15"   "batch-4"   "batch-11"  "batch-8"   "batch-3"   "batch17"  
[13] "batch-1"   "batch19"   "batch22"   "batch11"   "batch3"    "batch18"  
[19] "batch-7"   "batch2"    "batch12"   "batch9"    "batch16"   "batch6"   
[25] "batch-5"   "batch4"    "batch7"    "batch-6"   "batch20"   "batch-10" 
[31] "batch5"    "batch-9"   "batch8"    "batch-2"   "batchNA"  
> unique(a$centre)
 [1] "centre11009" "centre11006" "centre11021" "centre11007" "centre11014"
 [6] "centre11017" "centre11003" "centre11002" "centre11012" "centre11010"
[11] "centre11013" "centre11008" "centre11005" "centre11020" "centre11018"
[16] "centre11016" "centre11001" "centre11011" "centre11004" "centre11022"
[21] "centre10003" "centre11023"
ADD REPLYlink modified 18 months ago • written 18 months ago by landscape95170
0
gravatar for FractalDumpster
8 months ago by
FractalDumpster0 wrote:

I came across your question while Googling the same problem myself. Since I eventually figured it out, I decided to come back here and post an answer for anyone else who happens to Google across this:

Given the coding of your variable, I am guessing this is the "center" variable from UK Biobank: https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=10

The UK Biobank has some centers with a relatively small sample size - do a cross-tab of center by your phenotype, and make sure there is no center that provides less than 1% of your cases or controls or has less than about 100 people per cell. In my case, I had a center that was providing 27 of my approximately 16,000 cases, and Plink didn't like that. After grouping the two smallest centers (Swansea and Wrexham) together into a category called "centerOTHER", I was able to get Plink to run normally.

I also have two alternatives work-arounds that I do not recommend (my first solution in the paragraph above is better), but which may prove useful to someone so I'll include them:

  • It worked if I dummy-coded center manually using model.matrix in R to create my .covar file for Plink (eg. 0/1 variables for center11001, center11002, etc.) and then let the --covar-variance-standardize flag do its thing on them. When using this kind of dummy-coding, remember you that if you have n centers, you should have n-1 center variables (the one you leave out will be your reference group, eg. if I leave out center11001 then people who are 0 on all center variables are from center11001). I do NOT recommend this since I am not sure what statistical side-effects it may have to let --covar-variance-standardize do its thing on each separate dummy variable.
  • You could also use Plink's --vif flag to set a very high threshold so it will still do the regressions even if it detects a problem. This solution is also not recommended unless you know enough about statistics to know when you should/shouldn't do it and are sure it's OK for your specific situation.
ADD COMMENTlink modified 8 months ago • written 8 months ago by FractalDumpster0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1600 users visited in the last hour