Question: LD-decay in a r2 vs distance(cm) plot
gravatar for adele
6 months ago by
adele10 wrote:

I would like to create a LD vs distance(cm) plot in R using an output from PLINK. I have been trying with different --ld-window-kb numbers but I can not plot them in R. Furthermore, I have a question about average r2. In some papers ld_decay is plotted average r2 vs distance and some papers used r2 vs distance. Can everybody explain more?
thanks for your attention

snp R • 1.5k views
ADD COMMENTlink modified 6 months ago by hewm200810 • written 6 months ago by adele10

Could you give an example of those paper?

ADD REPLYlink written 6 months ago by GabrielMontenegro400
gravatar for rmf
6 months ago by
rmf340 wrote:

I've struggled with this too. This was my approach. Suggestions to improve this is welcome.

When you ask plink for LD r2 using this command,

plink --bfile "snp" --r2 --out "snp"

behind the scenes, --ld-window-r2 is set to 0.2 which means snps with r2 value below 0.2 are ignored. Also pairwise comparisons are limited to 1000 kb windows --ld-window-kb. Both of these are unsuitable for LD decay plot. We want pairwise comparisons for all r2 values and pairwise comparisons across all snps across the whole chromosome.

But you mostly likely have too many SNPs to do that. So first thing, thin it down evenly. I used mapthin. It works with the plink bim files.

mapthin -b 20 snp.bim snp-thin.bim

Total number of SNPs in original file: 18731824
Number of SNPs in thinned file: 26952 (0.143883%)
Mean base pair position (in file) between SNPs: 50185.6 bpp

This means keep 20 snps every 10^6 bases and create a new bim. The genome length of my organism is 1 bil bases. If your org has even larger genome, you might want to lower this number. Now that you have a thinned set, you can run plink to compute r2 values between all pairs. Set --ld-window-kb to the length of the longest chromosome in kb. I am not sure what this is --ld-window, but I am told to set it to some stupidly large number.

plink --bfile "snp-thin" --r2 --ld-window-r2 0 --ld-window 999999 --ld-window-kb 8000 --out "snp-thin"

This exports the file snp-thin.ld with all pairwise comparisons and r2 values. You should get something like this

 CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2
     1          359         Var-1-359      1        10377       Var-1-10377            1
     1          359         Var-1-359      1        30391       Var-1-30391            1
     1          359         Var-1-359      1        40409       Var-1-40409            1

This could be a fairly big file depending on how many thinned snps you had. It was around 5GB for me. All we want is the distance between each pair and the r2 value. We do that to create a summary file.

cat snp-thin.ld | sed 1,1d | awk -F " " 'function abs(v) {return v < 0 ? -v : v}BEGIN{OFS="\t"}{print abs($5-$2),$7}' | sort -k1,1n > snp-thin.ld.summary

Summary file looks like

1       0.157895
1       0.489362
1       1
1       1
1       1

Now plotting in R. Read the table into R, add some column names.


dfr <- read.delim("snp-thin.ld.summary",sep="",header=F,check.names=F,stringsAsFactors=F)
colnames(dfr) <- c("dist","rsq")

This data is going to be quite noisy. I found it best to categorise distances into intervals of a fixed length and compute mean r2 within each interval. Below, I group into 10 Kb intervals.

dfr$distc <- cut(dfr$dist,breaks=seq(from=min(dfr$dist)-1,to=max(dfr$dist)+1,by=100000))

Then compute mean and/or median r2 within the blocks

dfr1 <- dfr %>% group_by(distc) %>% summarise(mean=mean(rsq),median=median(rsq))

A helper step to get mid points of our distance intervals for plotting.

dfr1 <- dfr1 %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),

Now, plot.

  labs(x="Distance (Megabases)",y=expression(LD~(r^{2})))+

This should produce something like this:

enter image description here

You could also do this on a smaller scale by limiting pairwise comparisons to smaller window size thereby getting more finer on the lower end. Perhaps this is what you want. Otherwise, you look at this and think this is not as interesting as you thought it would be. Then, what you really want to see is the LD block length distribution. That continues as a new answer as I am out of space.

ADD COMMENTlink modified 6 months ago • written 6 months ago by rmf340

Thanks for your comment. That was helpful.

ADD REPLYlink modified 4 months ago • written 4 months ago by HG20
gravatar for rmf
6 months ago by
rmf340 wrote:

For LD block length distribution, I ran this on the thinned SNP set.

plink --bfile "snp-thin"  --blocks --blocks-max-kb 200 --out "snp-thin"

This produces snp-thin.blocks.det. Then some R.

dfr <- read.delim("snp-thin.blocks.det",sep="",header=T,check.names=F,stringsAsFactors=F)
colnames(dfr) <- tolower(colnames(dfr))

# ld block density
p <- ggplot(dfr,aes(x=kb))+
  labs(x="LD block length (Kb)",y="Density")+


enter image description here

Now you see LD block length distribution for your set of samples.

ADD COMMENTlink written 6 months ago by rmf340

I had so much fun learning with you. Your answers were very insightful and interactive, so thank you for sharing your wisdom with me.

ADD REPLYlink written 6 months ago by adele10
gravatar for hewm2008
6 months ago by
hewm200810 wrote:

I thinks this software(PopLDdecay) can help you to draw LDdecay Figure directly,

The simplest usage

./bin/PopLDdecay    -InVCF  SNP.vcf.gz  -OutStat LDdecay   [#other para:   -SubPop    GroupA_sample.list]
  perl  bin/  -inFile   LDdecay.stat.gz  -output  Fig

Use Data of this web site to test follow software ( ) only with two based site in chr22 (minimal SNP database) of the 1000 Genomes Project. ALL the pair-wise SNP R^2 is the same.

| 板本号        |     平均内存      |  核心程序计算时间       | 输入格式转化和结果统计时间        | 输出结果|
| version       |    Average memory |  Core calculation CPUs  | Format conver&Statistics Time CPU| result size|

|Plink 1.07     |    1.4G           |     680min              | 5 min+45min     |         54G
|Plink 2.0      |    18.807G        |      25min              | 5 min+45min     |         54G
|Haploview 4.2  |    95.760G        |     3904min             | 5 min+45min     |         54G
|PopLDdcay 3.30 |    2.5G           |     200min              |  0min           |         4.1M

if all chr (1-22) were calculated, it will take too much hard disk space to stored the temporary result.

ADD COMMENTlink modified 6 months ago • written 6 months ago by hewm200810
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 921 users visited in the last hour