I am trying to carry out a SNP genomic enrichment analysis and I was hoping you could help.
Basically, I have the following two sets of SNPs:
-set_A: 1,695 foreground SNPs. These are 1000g variants which, in addition, are QTL for a trait I'm interested in. They all are within ChIP-seq peak intervals for a TF.
-set_B: 116,000 background SNPs. These are a superset of set_A and all are within ChIP-seq peaks for the same TF above. These represent all the SNPs I had tested for the QTL property above.
I want to determine whether set_A is enriched in some particular annotation compared to set_B. In other words, I want to know whether, compared to all SNPs tested for QTL in my ChIP-seq peaks, my set_A is enriched in some annotation. For example, this annotation might be strong LD intervals around GWAS genome wide significant SNPs from the GWAS catalog. Therefore I want to ask:
"Are my set_A variants more likely to be in GWAS LD blocks for some disease/trait compared to the background set of SNPs?"
I have ascertained already that set_A are MAF matched to set_B (bootstrapped KS test of the two MAF distributions), so this should not be a problem. I ran the GAT simulation-based enrichment tool:
which works fine and has returned enrichment results. However, I believe my foreground and background sets need more pre-processing: there is LD structure both within set_A and within set_B. So some SNPs in A are in LD across them and some SNPs in B are in LD across them. I believe I need to correct for this, too, to avoid inflation of enrichment. I would probably need to LD-match set_A and set_B, or maybe pool or subsample independent SNPs only from set_A and set_B. The GAT, which is designed to compute simple interval enrichments, cannot do this.
There is a tool which might be able to help me, by the BROAD, called SNPsnap:
Interestingly, SNPsnap should be able to carry out LD-clumping of the foreground SNPs, so it can correct for LD-derived inflation of enrichments. However, SNPsnap only returns a frequency matched background of (at most) 20.000 snps: I don't need this, because I believe I already have the most suitable background set (set_B) (and in any case I need my background snps to be in the ChIP-seq peaks).
Additionally, it seems SNPsnap is quite experimental (I have had about 80% of runs fail on me) and any mails to the authors go unanswered. So I believe the program is not really supported.
Therefore I was hoping anyone on here had ideas on how to do this:
LD clumping: what if I mapped my set_A snps to strong LD intervals and computed, instead of the enrichment of set_A snps in GWAS LD blocks, the enrichment of set_A LD blocks in GWAS LD blocks?
Else, for each LD block containing more than 1 set_A SNP, I could select the "best" according to some metric? Any other ideas or suitable tools?
Thanks for any suggestions you might be willing to share.