Question: Pruning GWAS summary statistics for LD
gravatar for ika
7 months ago by
ika40 wrote:

I have summary statistics of a GWAS in a tab-separated format, as follows:


I want to adjust these summary statistics for LD (preferentially by pruning), keeping only one of the SNPs in LD. I don't want to threshold for p-value (in fact, the pruned SNPs should stay representative in terms of p-values, as I'm trying to do a sort of enrichment analysis and need non-significant and significant results alike). I also don't have access to the genotype data - only to these summary statistics. They also contain some X-chromosomal SNPs.

I'm not sure which tool is suitable for this. I've considered the following tools:

  • Plink

    As far as I know, one can perform LD pruning in Plink. However, I can't seem to find a way to perform this pruning in Plink with this file format.

  • GCTA-Cojo

    From reading, this seems to do exactly what I want it to - however, the documentation states that "it will be extremely time-consuming if you set a very low significance level, e.g. 5e-3" . I'm guessing this might not run to completion if I were to try this with no p-value threshold at all.

  • LDpred

    The documentation states "LDpred is a Python based software package that adjusts GWAS summary statistics for the effects of linkage disequilibrium". This does sound like what I need, though I'm not sure if it really can perform this step of LD pruning in isolation. I wanted to try it, though I've not gotten it to work on my system.

Any help is greatly appreciated. Is one of these tools suitable or is there another that can be used for this task? Is it possible to wrangle these summary statistics into a format suitable for Plink? Is GCTA-Cojo feasible without a p-value threshold? Is LDpred capable of this and would be worth spending time to set up?

cojo ld pruning ldpred plink • 730 views
ADD COMMENTlink modified 7 months ago by Sam3.3k • written 7 months ago by ika40
gravatar for Sam
7 months ago by
New York
Sam3.3k wrote:

It seems like you want to do clumping (prunning the summary statistics, but keeping the most significant SNP)

You will always need a reference panel

plink  --clump <sumstat> --clump-p1 <max p-value to retain> --clump-p2 1 --clump-r2 <r2 threshold> --clump-kb <window size> --bfile <reference> --out <output prefix>
ADD COMMENTlink written 7 months ago by Sam3.3k

Thank you for your answer. I actually don't want to keep the most significant SNP. Ideally, I want to adjust only for LD, without selecting or filtering based on p-value. Essentially, I need the p-value distribution to stay representative.

Or is this not possible?

ADD REPLYlink written 7 months ago by ika40

That's possible to do pruning, but that isn't exactly random either. You can do that by first making a list of SNPs in your summary statistic file, then do

plink --bfile <ld-reference> --indep-pairwise 200 50 0.25 --out <prefix>

This will ask plink to perform pruning with a window size of 200kb, sliding across the genome with step size of 50 variants at a time, and filter out any SNPs with LD r2 higher than 0.25

ADD REPLYlink written 7 months ago by Sam3.3k

Hi Sam,

If my data is build 37 what kind of reference panel you would recommend for this purpose, can you please share some link?

ADD REPLYlink written 4 months ago by anamaria140

You can use the 1000 genome as a reference provided by plink (might need to convert the provided file to bed bim fam format using --make-bed from plink 2 in order to run clumping on plink 1.9)

ADD REPLYlink written 4 months ago by Sam3.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1503 users visited in the last hour