PLINK logistic regression analysis and covariates
1
2
Entering edit mode
23 months ago
lovegenetics ▴ 40

Hi everyone,

I'm struggling with PLINK and the results being generated by PLINK. I'm not sure which one to look at and consider any statistical significance.

I'm using a SNP data from a control/case study. I firstly did the basic association analysis using the following command:

plink --file xxx --assoc --ci 0.95 --out newfile


which generated a file with p values for each SNPs including an OR and L/U95 etc.

I then wanted to use the logistic regression command, which I have a covariate file that consists of 3 columns (FID, IID, and a covariate which i will call (CN), so I wanted to see if the covariate has an effect on phenotype or the outcome:

plink --file xx --logistic --covar covariate.txt --covarname CN --ci 0.95 --out newfile


which gave me the following results:

CHR SNP    BP          A1 TEST NMISS OR         STAT      P
8       SNP1 6962046 G   ADD   1058    0.646      -3.607     0.00031
8       SNP1 6962046 G   CN      1058    0.9289   -1.9         0.05737


I'd really appreciate it if someone explained to me what the results mean? also, the SNP1 from the assoc analysis showed with a P-value of (2.95E-06), I don't know if that would help.

I just want know what kind of command to use to generate an output knowing the effect of my covariate to the phenotype and how to interpret it. I have also used the --genotypic command in the logistic regression analysis but it has given me different p values and OR. which test do I need to use?

Using the --genotypic flag with --logistic gererated an output that looks like this:

CHR SNP   BP          A1  TEST           NMISS  OR       SE           L95       U95      STAT     P
8   SNP1 6962046 G   ADD             1058     0.6935  0.1951    0.4731  1.016    -1.876    0.06063
8   SNP1 6962046 G   DOMDEV     1058     0.8997  0.2302    0.573    1.413    -0.4593  0.646
8   SNP1 6962046 G   CN                1058     0.93      0.03887  0.8618  1.004    -1.868   0.06183
8   SNP1 6962046 G   GENO_2DF  1058     NA       NA           NA        NA        13.32    0.001284

PLINK SNP ASSOCIATION • 4.5k views
0
Entering edit mode

The file has a header line. Why are you ignoring it?

Every time you run plink, it prints a URL pointing to online documentation. Why are you ignoring that, too?

1
Entering edit mode

Thank you for pointing that out @chrchang523. I appreciate your time and effort in helping me out :). But to answer your questions, I have definitely not ignored any of the things you mentioned because they are all part of the analysis and understanding. But I find PLINK difficult to comprehend 100%, after six months of on and off with PLINK, i thought I'd try and seek help. I don't think this platform should be used to put people down. We are not trying to find an easy way out, we are trying to understand.

0
Entering edit mode

I stand by the harsh tone of my comment, because you did not include the header line in your question. There is simply no excuse for that when you are asking what the columns mean.

(edit: ok, I see that you have edited in the column headers. Now we can get somewhere.)

2
Entering edit mode
22 months ago

Hey, in the first example that you give:

CHR     SNP  BP      A1  TEST  NMISS OR     STAT    P
8       SNP1 6962046 G   ADD   1058  0.646  -3.607  0.00031
8       SNP1 6962046 G   CN    1058  0.9289 -1.9    0.05737


Here, the p-value for the SNP, after controlling for the covariate, 'CN', is given on the first row (ADD). On the second row is just the p-value for the covariate against the phenotype under study, and this is technically controlled for by the SNP.

The model is of form:

PHENO ~ SNP + CN


## ----------------------

In your second, '--genotypic', example, one has to understand that there are different ways of conducting analyses using genetic variant data:

• additive models (ADD) - most basic and are simply based on allelic tallies / dosage, with individuals having 0, 1, or 2 disease alleles. Usually, it is the minor allele that is being counted and tested (i.e., 0, 1, or 2 minor alleles)
• genotypic models (GENO) - more based on the fact that we can have 3 genotypes at any position: AA | AB | BB. Further, one can add in extra assumptions about dominance (AA) and recessiveness (aa) to these models.

If you are unsure about the second part with --genotypic, then just use the more simple case as in your first example.

Kevin

0
Entering edit mode

HI Kevin,

in his example:

CHR     SNP  BP      A1  TEST  NMISS OR     STAT    P
8       SNP1 6962046 G   ADD   1058  0.646  -3.607  0.00031
8       SNP1 6962046 G   CN    1058  0.9289 -1.9    0.05737


how would you present the effect of his covariate CN on logistic analysis for all SNPs?

0
Entering edit mode

If you want the p-values for all SNPs, controlling / adjusting for CN, then extract all ADD lines. Is this what you want?

0
Entering edit mode

no I would like to see the difference in regression analysis when using say "CN" covariate (with or without CN covariate). Should I run one regression analysis using all covariates including "CN" and one regression without "CN". If yes how do I visualize the difference? Just scatter plot of ADD p values for with and without "CN" covariate?

1
Entering edit mode

Yes, it should be okay to do the analysis with CN and without CN, and take the p-values for ADD in both cases.

In my example (above), however, we are justified to include the covariate due to the fact that its p-value against the end-point (trait) is 0.05737 (well, not statistically significant at 5% alpha, but, anyway).

We would usually compare the effect sizes via the beta coefficients. I think that you need to add --beta to your command in order to return these (PLINK < v1.9), or --logistic beta (PLINK >= v1.9).

0
Entering edit mode

This is actually my command I my question was related to the covariate in my code called "TD"

plink2 --threads 24 --pfile temporary1 --recover-var-ids


NVCFchr1.vcf.gz 'force' --pheno pheno_question_hba1c_all.txt --pheno-name pheno --covar pheno_question_hba1c_all.txt --covar-name sex,age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,TD,array,HBA1C --glm genotypic cols=+beta,+err --out allquestionBiobankFINchr1

The output for one SNP looks like this:

#CHROM  POS ID  REF ALT A1  TEST    OBS_CT  BETA    SE  Z_OR_F_STAT P   ERRCODE
10  135434303   rs11101905  G   A   A   ADD 11863   -0.110733   0.0986981   -1.12193    0.261891    .
10  135434303   rs11101905  G   A   A   DOMDEV  11863   0.079797    0.111004    0.718868    0.472222    .
10  135434303   rs11101905  G   A   A   sex=Female  11863   -0.120404   0.0536069   -2.24605    0.0247006   .
10  135434303   rs11101905  G   A   A   age 11863   0.00524501  0.00391528  1.33963 0.180367    .
10  135434303   rs11101905  G   A   A   PC1 11863   -0.0191779  0.0166868   -1.14928    0.25044 .
10  135434303   rs11101905  G   A   A   PC2 11863   -0.0269939  0.0173086   -1.55957    0.118863    .
10  135434303   rs11101905  G   A   A   PC3 11863   0.0115207   0.0168076   0.685448    0.493061    .
10  135434303   rs11101905  G   A   A   PC4 11863   9.57832e-05 0.0124607   0.0076868   0.993867    .
10  135434303   rs11101905  G   A   A   PC5 11863   -0.00191047 0.00543937  -0.35123    0.725416    .
10  135434303   rs11101905  G   A   A   PC6 11863   -0.0103309  0.0159879   -0.646172   0.518168    .
10  135434303   rs11101905  G   A   A   PC7 11863   0.00790997  0.0144025   0.549207    0.582863    .
10  135434303   rs11101905  G   A   A   PC8 11863   -0.00205639 0.0142709   -0.144096   0.885424    .
10  135434303   rs11101905  G   A   A   PC9 11863   -0.00873771 0.0057239   -1.52653    0.126878    .
10  135434303   rs11101905  G   A   A   PC10    11863   0.0116197   0.0123826   0.938388    0.348045    .
10  135434303   rs11101905  G   A   A   TD  11863   -0.670026   0.0962216   -6.96337    3.32228e-12 .
10  135434303   rs11101905  G   A   A   array=Biobank   11863   0.160666    0.073631    2.18205 0.0291062   .
10  135434303   rs11101905  G   A   A   HBA1C   11863   0.0265933   0.00168758  15.7583 6.0236e-56  .
10  135434303   rs11101905  G   A   A   GENO_2DF    11863   NA  NA  0.726514    0.483613    .


as you can see the p value for TD is quite small. So with this output, which is regression with TD. Do you still recommend running another regression without TD as covariate and comparing p ADD value in both runs or? How would you compare this using BETAs?

0
Entering edit mode

Hi Kevin,

can you please tell me if you have any further thoughts on this?

0
Entering edit mode

Hi Kevin,

I did run GWAS with and without TD covariate and p values appear to be highly correlated:

data: tt$P_TD and tt$P_noTD t = 20.17, df = 283, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7156134 0.8117108 sample estimates: cor 0.7679493

So can I conclude that stratification by TD is not making a huge difference in the results? Also in my cohort about 80% of cases had Type 2 diabetes while about 8% has Type 1. (TD is reference for the type of diabetes)

Is there is any other way that I can check the stratification by the TD other than the above?