Hello,
I am using GCTA COJO to perform conditional analysis on summary statistics from a lung cancer GWAS. Specifically, I am trying to condition the P-values of non-coding SNPs on coding SNPs. However, I am getting an unusually high number of NAs in place of my corrected betas, standard errors, and P-values.
Because there are so many coding SNPs, I first select all conditionally significant coding SNPs, then condition all non-coding SNPs on these.
My commands look like this.
gcta64 --bfile [my file for LD calculations] \
--cojo-file [the summary statistics from the GWAS] \
--extract [a list of coding variants] \
--cojo-slct \
--cojo-p 5e-4 \
--out [conditioned coding SNP values]
I then use awk to pull out the conditionally significant coding SNPs.
gcta64 --bfile [same as above] \
--cojo-file [same as above] \
--cojo-cond [the conditionally significant coding SNPs from awk] \
--out [my final summary statistics]
Taking chr8 as an example, many of my results look like this:
Chr  SNP             bp      refA  freq        b             se          p            n       freq_geno   bC            bC_se        pC
8    8:156716:G:C    156716  C     0.00241033  0.00206278   0.00061818   0.000847384  300437  0.00396991  0.00206278    0.00061819   0.000847478
8    8:156747:C:G    156747  G     0.00221684  0.00223315   0.000640693  0.000491219  304047  0.00396991  0.00223315    0.000640705  0.000491294
8    8:157714:C:G    157714  G     0.00315905  -0.000603486 0.000521356  0.247056     322534  0.00480568  -0.000603486  0.000521356  0.247056
But many more look like this:
Chr  SNP           bp      refA  freq       b             se           p         n       freq_geno  bC  bC_se  pC
8    8:156244:T:C  156244  C     0.036727   1.05558e-05   0.000164043  0.948693  289987  0.0385499  NA  NA     NA
8    8:156288:G:C  156288  C     0.253299   -8.04971e-07  6.76117e-05  0.990501  319304  0.25491    NA  NA     NA
8    8:156294:C:A  156294  A     0.0519816  -9.96523e-05  0.000136206  0.464395  301973  0.052967   NA  NA     NA
Overall, for this chromosome, only ~140,000 of the ~655,000 SNPs I input got actual conditioned results, as opposed to NAs.
I have run this analysis on different GWAS results (in the same format) and not had this problem. Does anyone know what might be going on?
Thank you very much!
Edit 1 I know that, as discussed here, if a SNP is completely predictable from a linear combination of SNPs fixed in the model, its P-value will be NA. However, I do not believe this is the explanation for two reasons.
- Because I am working with summary statistics for the GWAS, the colinearity is determined from the file of genotypes I use to calculate LD. I have used the same genotypes for conditioning variants in other GWAS without getting nearly as many NAs in my results. 
- True, which coding variants I am conditioning on varies for different GWAS. However, for my example of chr8, I am only conditioning on one coding SNP, so this explanation would require 500,000 variants spread across the chromosome to be in perfect LD with the one SNP I am conditioning on. 
Edit 2
I have noticed something strange, but am not sure what to make of it. Of variants on chr8 with a MAF below 0.0104, 75% have a corrected P-value. Of variants with a MAF above this cutoff, every single one returns an NA for the P-value.
Is
8:156244:T:Cin your list supplied to--cojo-cond?