I am using GCTA COJO to perform conditional analysis on summary statistics from a lung cancer GWAS. Specifically, I am trying to condition the P-values of non-coding SNPs on coding SNPs. However, I am getting an unusually high number of NAs in place of my corrected betas, standard errors, and P-values.
Because there are so many coding SNPs, I first select all conditionally significant coding SNPs, then condition all non-coding SNPs on these.
My commands look like this.
gcta64 --bfile [my file for LD calculations] \ --cojo-file [the summary statistics from the GWAS] \ --extract [a list of coding variants] \ --cojo-slct \ --cojo-p 5e-4 \ --out [conditioned coding SNP values]
I then use awk to pull out the conditionally significant coding SNPs.
gcta64 --bfile [same as above] \ --cojo-file [same as above] \ --cojo-cond [the conditionally significant coding SNPs from awk] \ --out [my final summary statistics]
Taking chr8 as an example, many of my results look like this:
Chr SNP bp refA freq b se p n freq_geno bC bC_se pC 8 8:156716:G:C 156716 C 0.00241033 0.00206278 0.00061818 0.000847384 300437 0.00396991 0.00206278 0.00061819 0.000847478 8 8:156747:C:G 156747 G 0.00221684 0.00223315 0.000640693 0.000491219 304047 0.00396991 0.00223315 0.000640705 0.000491294 8 8:157714:C:G 157714 G 0.00315905 -0.000603486 0.000521356 0.247056 322534 0.00480568 -0.000603486 0.000521356 0.247056
But many more look like this:
Chr SNP bp refA freq b se p n freq_geno bC bC_se pC 8 8:156244:T:C 156244 C 0.036727 1.05558e-05 0.000164043 0.948693 289987 0.0385499 NA NA NA 8 8:156288:G:C 156288 C 0.253299 -8.04971e-07 6.76117e-05 0.990501 319304 0.25491 NA NA NA 8 8:156294:C:A 156294 A 0.0519816 -9.96523e-05 0.000136206 0.464395 301973 0.052967 NA NA NA
Overall, for this chromosome, only ~140,000 of the ~655,000 SNPs I input got actual conditioned results, as opposed to NAs.
I have run this analysis on different GWAS results (in the same format) and not had this problem. Does anyone know what might be going on?
Thank you very much!
Edit 1 I know that, as discussed here, if a SNP is completely predictable from a linear combination of SNPs fixed in the model, its P-value will be NA. However, I do not believe this is the explanation for two reasons.
Because I am working with summary statistics for the GWAS, the colinearity is determined from the file of genotypes I use to calculate LD. I have used the same genotypes for conditioning variants in other GWAS without getting nearly as many NAs in my results.
True, which coding variants I am conditioning on varies for different GWAS. However, for my example of chr8, I am only conditioning on one coding SNP, so this explanation would require 500,000 variants spread across the chromosome to be in perfect LD with the one SNP I am conditioning on.
I have noticed something strange, but am not sure what to make of it. Of variants on chr8 with a MAF below 0.0104, 75% have a corrected P-value. Of variants with a MAF above this cutoff, every single one returns an NA for the P-value.