I have summary statistics of a GWAS in a tab-separated format, as follows:
SNP CHR BP GENPOS ALLELE1 ALLELE0 A1FREQ F_MISS BETA SE P
I want to adjust these summary statistics for LD (preferentially by pruning), keeping only one of the SNPs in LD. I don't want to threshold for p-value (in fact, the pruned SNPs should stay representative in terms of p-values, as I'm trying to do a sort of enrichment analysis and need non-significant and significant results alike). I also don't have access to the genotype data - only to these summary statistics. They also contain some X-chromosomal SNPs.
I'm not sure which tool is suitable for this. I've considered the following tools:
Plink
As far as I know, one can perform LD pruning in Plink. However, I can't seem to find a way to perform this pruning in Plink with this file format.
GCTA-Cojo
From reading, this seems to do exactly what I want it to - however, the documentation states that "it will be extremely time-consuming if you set a very low significance level, e.g. 5e-3" . I'm guessing this might not run to completion if I were to try this with no p-value threshold at all.
LDpred
The documentation states "LDpred is a Python based software package that adjusts GWAS summary statistics for the effects of linkage disequilibrium". This does sound like what I need, though I'm not sure if it really can perform this step of LD pruning in isolation. I wanted to try it, though I've not gotten it to work on my system.
Any help is greatly appreciated. Is one of these tools suitable or is there another that can be used for this task? Is it possible to wrangle these summary statistics into a format suitable for Plink? Is GCTA-Cojo feasible without a p-value threshold? Is LDpred capable of this and would be worth spending time to set up?
Thank you for your answer. I actually don't want to keep the most significant SNP. Ideally, I want to adjust only for LD, without selecting or filtering based on p-value. Essentially, I need the p-value distribution to stay representative.
Or is this not possible?
That's possible to do pruning, but that isn't exactly random either. You can do that by first making a list of SNPs in your summary statistic file, then do
This will ask plink to perform pruning with a window size of 200kb, sliding across the genome with step size of 50 variants at a time, and filter out any SNPs with LD r2 higher than 0.25
Hi Sam,
If my data is build 37 what kind of reference panel you would recommend for this purpose, can you please share some link?
You can use the 1000 genome as a reference provided by plink (might need to convert the provided file to bed bim fam format using --make-bed from plink 2 in order to run clumping on plink 1.9)