Question: Plink: Understanding LD Clumping vs Pruning
gravatar for JourneyToAbyss
8 months ago by
JourneyToAbyss70 wrote:

Background: I am a grad student doing eQTL analysis and just starting to dip my feet into plink. From what I understand, LD pruning is typically done by the '--indep-pairwise option'. Additionally, I can use the '--show-tags all' option to keep track of the pruned SNPs.

Thing is, I came across a tutorial highlighting why clumping is preferred over pruning ( I believe I understand the limitation with pruning (e.g. a situation where many SNPs are prune may arise, creating larger-then intended regions with no SNP representation). That being said, I am quite confuse about how clumping works and simply looking for more material to read on. For instance, I don't understand what the association test is calculating and how that is being used in the clumping procedure (according to the tutorial, there is some MAF statistic being used, but that statistic isn't present in the association file I created). I'm also having difficulty understanding how the index variant and clump variant are used.

Perhaps I am going down a rabbit hole that I shouldn't be concerned with based on my eventual goal (eQTL). But was hoping someone could recommend some resources comparing the two approaches.

plink ld pruning eqtl clumping • 2.5k views
ADD COMMENTlink modified 8 months ago by Kevin Blighe44k • written 8 months ago by JourneyToAbyss70

I can explain the algorithms for you:

  1. pruning: it uses the first SNP (in genome order) and computes the correlation with the following ones (e.g. 50). When it finds a large correlation, it removes one SNP from the correlated pair, keeping the one with the largest minor allele frequency (MAF), thus possibly removing the first SNP. Then it goes on with the next SNP (not yet removed). So, in some worst case scenario, this algorithm may in fact remove all SNPs of the genome (expect one).

  2. clumping; it uses some statistic (usually p-value in the case of GWAS/PRS) to sort the SNPs by importance (e.g. keeping the most significant ones). It takes the first one (e.g. most significant SNP) and removes SNPs that are too correlated with this one in a window around it. As opposed to pruning, this procedure makes sure that this SNP is never removed, keeping at least one representative SNP by region of the genome. Then it goes on with the next most significant SNP that has not been removed yet. In the case of computing principal components, there is no p-value available, so I propose to use the MAF instead as the statistic to rank SNPs (in decreasing order). Using MAFs makes clumping very similar to pruning, but without any worst-case scenario.

ADD REPLYlink modified 4 months ago • written 4 months ago by Florian Privé (privefl)30

If I remember correctly, that blog was written with Polygenic Score Analysis in mind where Clumping is preferred. The reason why clumping is preferred in Polygenic Score analysis is that we want to maintain the SNPs that has the strongest signal (lowest p-value). With pruning, the SNPs were randomly removed whereas with clumping, we preferentially retain any SNPs with stronger signal, therefore allow us to construct a more predictive polygenic risk score.

ADD REPLYlink written 8 months ago by Sam2.3k

Of course clumping should be preferred in Polygenic Score analysis.

In the document, the author (me) refers to the case of computing Principal Components, where pruning is typically used.

If this document is not clear enough, please mention which parts and I'll try to improve it.

ADD REPLYlink written 4 months ago by Florian Privé (privefl)30
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe44k
South America | Europe | USA
Kevin Blighe44k wrote:

It does not seem like an in depth analysis (the link that you posted). If the user feels that clumping is preferable to pruning in every situation, then please encourage her/him to publish the concept as part of a more comprehensive analysis. I only see one pruning example in the blog, when, in fact, pruning can be performed in a diverse number of ways through pre-filtering on the input data and then, also, modifying the parameters passed to --indep-pairwise.

Keep in mind that anybody can post anything to the World Wide Web - all opinions are wholly representative.

In certain situations, clumping may indeed be appropriate, but not all. To understand clumping better, please read the docs at the PLINK website itself:


ADD COMMENTlink modified 12 weeks ago • written 8 months ago by Kevin Blighe44k

This has been published in It is mainly a worst-case scenario, warning people that pruning could potentially go wrong. As stated in the paper, using pruning or clumping (on MAF) gives very similar results in the general case.

ADD REPLYlink written 4 months ago by Florian Privé (privefl)30

Thank you for the update, Florian.

ADD REPLYlink written 4 months ago by Kevin Blighe44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1437 users visited in the last hour