Plink: Understanding LD Clumping vs Pruning
1
16
Entering edit mode
5.6 years ago

Background: I am a grad student doing eQTL analysis and just starting to dip my feet into plink. From what I understand, LD pruning is typically done by the '--indep-pairwise option'. Additionally, I can use the '--show-tags all' option to keep track of the pruned SNPs.

Thing is, I came across a tutorial highlighting why clumping is preferred over pruning (https://privefl.github.io/bigsnpr/articles/pruning-vs-clumping.html). I believe I understand the limitation with pruning (e.g. a situation where many SNPs are prune may arise, creating larger-then intended regions with no SNP representation). That being said, I am quite confuse about how clumping works and simply looking for more material to read on. For instance, I don't understand what the association test is calculating and how that is being used in the clumping procedure (according to the tutorial, there is some MAF statistic being used, but that statistic isn't present in the association file I created). I'm also having difficulty understanding how the index variant and clump variant are used.

Perhaps I am going down a rabbit hole that I shouldn't be concerned with based on my eventual goal (eQTL). But was hoping someone could recommend some resources comparing the two approaches.

Plink eQTL LD Clumping Pruning • 31k views
14
Entering edit mode

I can explain the algorithms for you:

1. pruning: it uses the first SNP (in genome order) and computes the correlation with the following ones (e.g. 50). When it finds a large correlation, it removes one SNP from the correlated pair, keeping the one with the largest minor allele frequency (MAF), thus possibly removing the first SNP. Then it goes on with the next SNP (not yet removed). So, in some worst case scenario, this algorithm may in fact remove all SNPs of the genome (expect one).

2. clumping; it uses some statistic (usually p-value in the case of GWAS/PRS) to sort the SNPs by importance (e.g. keeping the most significant ones). It takes the first one (e.g. most significant SNP) and removes SNPs that are too correlated with this one in a window around it. As opposed to pruning, this procedure makes sure that this SNP is never removed, keeping at least one representative SNP by region of the genome. Then it goes on with the next most significant SNP that has not been removed yet. In the case of computing principal components, there is no p-value available, so I propose to use the MAF instead as the statistic to rank SNPs (in decreasing order). Using MAFs makes clumping very similar to pruning, but without any worst-case scenario.

5
Entering edit mode

If I remember correctly, that blog was written with Polygenic Score Analysis in mind where Clumping is preferred. The reason why clumping is preferred in Polygenic Score analysis is that we want to maintain the SNPs that has the strongest signal (lowest p-value). With pruning, the SNPs were randomly removed whereas with clumping, we preferentially retain any SNPs with stronger signal, therefore allow us to construct a more predictive polygenic risk score.

0
Entering edit mode

Of course clumping should be preferred in Polygenic Score analysis.

In the document, the author (me) refers to the case of computing Principal Components, where pruning is typically used.

If this document is not clear enough, please mention which parts and I'll try to improve it.

1
Entering edit mode
5.6 years ago

It does not seem like an in depth analysis (the link that you posted). If the user feels that clumping is preferable to pruning in every situation, then please encourage her/him to publish the concept as part of a more comprehensive analysis. I only see one pruning example in the blog, when, in fact, pruning can be performed in a diverse number of ways through pre-filtering on the input data and then, also, modifying the parameters passed to --indep-pairwise.

Keep in mind that anybody can post anything to the World Wide Web - all opinions are wholly representative.

In certain situations, clumping may indeed be appropriate, but not all. To understand clumping better, please read the docs at the PLINK website itself: http://zzz.bwh.harvard.edu/plink/clump.shtml

Kevin

4
Entering edit mode

This has been published in https://doi.org/10.1093/bioinformatics/bty185. It is mainly a worst-case scenario, warning people that pruning could potentially go wrong. As stated in the paper, using pruning or clumping (on MAF) gives very similar results in the general case.

0
Entering edit mode

Thank you for the update, Florian.