Pruning GWAS summary statistics for LD
1
2
Entering edit mode
2.8 years ago
ika ▴ 50

I have summary statistics of a GWAS in a tab-separated format, as follows:

SNP CHR BP  GENPOS  ALLELE1 ALLELE0 A1FREQ  F_MISS  BETA    SE  P


I want to adjust these summary statistics for LD (preferentially by pruning), keeping only one of the SNPs in LD. I don't want to threshold for p-value (in fact, the pruned SNPs should stay representative in terms of p-values, as I'm trying to do a sort of enrichment analysis and need non-significant and significant results alike). I also don't have access to the genotype data - only to these summary statistics. They also contain some X-chromosomal SNPs.

I'm not sure which tool is suitable for this. I've considered the following tools:

As far as I know, one can perform LD pruning in Plink. However, I can't seem to find a way to perform this pruning in Plink with this file format.

• GCTA-Cojo

From reading, this seems to do exactly what I want it to - however, the documentation states that "it will be extremely time-consuming if you set a very low significance level, e.g. 5e-3" . I'm guessing this might not run to completion if I were to try this with no p-value threshold at all.

• LDpred

The documentation states "LDpred is a Python based software package that adjusts GWAS summary statistics for the effects of linkage disequilibrium". This does sound like what I need, though I'm not sure if it really can perform this step of LD pruning in isolation. I wanted to try it, though I've not gotten it to work on my system.

Any help is greatly appreciated. Is one of these tools suitable or is there another that can be used for this task? Is it possible to wrangle these summary statistics into a format suitable for Plink? Is GCTA-Cojo feasible without a p-value threshold? Is LDpred capable of this and would be worth spending time to set up?

ldpred cojo plink LD pruning • 4.0k views
5
Entering edit mode
2.8 years ago
Sam ★ 4.6k

It seems like you want to do clumping (prunning the summary statistics, but keeping the most significant SNP)

You will always need a reference panel

plink  --clump <sumstat> --clump-p1 <max p-value to retain> --clump-p2 1 --clump-r2 <r2 threshold> --clump-kb <window size> --bfile <reference> --out <output prefix>

0
Entering edit mode

Thank you for your answer. I actually don't want to keep the most significant SNP. Ideally, I want to adjust only for LD, without selecting or filtering based on p-value. Essentially, I need the p-value distribution to stay representative.

Or is this not possible?

4
Entering edit mode

That's possible to do pruning, but that isn't exactly random either. You can do that by first making a list of SNPs in your summary statistic file, then do

plink --bfile <ld-reference> --indep-pairwise 200 50 0.25 --out <prefix>


This will ask plink to perform pruning with a window size of 200kb, sliding across the genome with step size of 50 variants at a time, and filter out any SNPs with LD r2 higher than 0.25

0
Entering edit mode

Hi Sam,

If my data is build 37 what kind of reference panel you would recommend for this purpose, can you please share some link?

1
Entering edit mode

You can use the 1000 genome as a reference provided by plink (might need to convert the provided file to bed bim fam format using --make-bed from plink 2 in order to run clumping on plink 1.9)