Question: LD-decay in a r2 vs distance(cm) plot
2
gravatar for adele
16 months ago by
adele30
adele30 wrote:

I would like to create a LD vs distance(cm) plot in R using an output from PLINK. I have been trying with different --ld-window-kb numbers but I can not plot them in R. Furthermore, I have a question about average r2. In some papers ld_decay is plotted average r2 vs distance and some papers used r2 vs distance. Can everybody explain more?
thanks for your attention

snp R • 4.7k views
ADD COMMENTlink modified 15 months ago by hewm200820 • written 16 months ago by adele30

Could you give an example of those paper?

ADD REPLYlink written 16 months ago by GabrielMontenegro520
15
gravatar for rmf
16 months ago by
rmf650
rmf650 wrote:

I've struggled with this too. This was my approach. Suggestions to improve this is welcome.

When you ask plink for LD r2 using this command,

plink --bfile "snp" --r2 --out "snp"

behind the scenes, --ld-window-r2 is set to 0.2 which means snps with r2 value below 0.2 are ignored. Also pairwise comparisons are limited to 1000 kb windows --ld-window-kb. Both of these are unsuitable for LD decay plot. We want pairwise comparisons for all r2 values and pairwise comparisons across all snps across the whole chromosome.

But you mostly likely have too many SNPs to do that. So first thing, thin it down evenly. I used mapthin. It works with the plink bim files.

mapthin -b 20 snp.bim snp-thin.bim

Total number of SNPs in original file: 18731824
Number of SNPs in thinned file: 26952 (0.143883%)
Mean base pair position (in file) between SNPs: 50185.6 bpp

This means keep 20 snps every 10^6 bases and create a new bim. The genome length of my organism is 1 bil bases. If your org has even larger genome, you might want to lower this number. Now that you have a thinned set, you can run plink to compute r2 values between all pairs. Set --ld-window-kb to the length of the longest chromosome in kb. I am not sure what this is --ld-window, but I am told to set it to some stupidly large number.

plink --bfile "snp-thin" --r2 --ld-window-r2 0 --ld-window 999999 --ld-window-kb 8000 --out "snp-thin"

This exports the file snp-thin.ld with all pairwise comparisons and r2 values. You should get something like this

 CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2
     1          359         Var-1-359      1        10377       Var-1-10377            1
     1          359         Var-1-359      1        30391       Var-1-30391            1
     1          359         Var-1-359      1        40409       Var-1-40409            1
...

This could be a fairly big file depending on how many thinned snps you had. It was around 5GB for me. All we want is the distance between each pair and the r2 value. We do that to create a summary file.

cat snp-thin.ld | sed 1,1d | awk -F " " 'function abs(v) {return v < 0 ? -v : v}BEGIN{OFS="\t"}{print abs($5-$2),$7}' | sort -k1,1n > snp-thin.ld.summary

Summary file looks like

1       0.157895
1       0.489362
1       1
1       1
1       1
...

Now plotting in R. Read the table into R, add some column names.

library(dplyr)
library(stringr)
library(ggplot2)

dfr <- read.delim("snp-thin.ld.summary",sep="",header=F,check.names=F,stringsAsFactors=F)
colnames(dfr) <- c("dist","rsq")

This data is going to be quite noisy. I found it best to categorise distances into intervals of a fixed length and compute mean r2 within each interval. Below, I group into 10 Kb intervals.

dfr$distc <- cut(dfr$dist,breaks=seq(from=min(dfr$dist)-1,to=max(dfr$dist)+1,by=100000))

Then compute mean and/or median r2 within the blocks

dfr1 <- dfr %>% group_by(distc) %>% summarise(mean=mean(rsq),median=median(rsq))

A helper step to get mid points of our distance intervals for plotting.

dfr1 <- dfr1 %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),
                        end=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"[0-9-e+.]+$")),
                        mid=start+((end-start)/2))

Now, plot.

  ggplot()+
  geom_point(data=dfr1,aes(x=start,y=mean),size=0.4,colour="grey20")+
  geom_line(data=dfr1,aes(x=start,y=mean),size=0.3,alpha=0.5,colour="grey40")+
  labs(x="Distance (Megabases)",y=expression(LD~(r^{2})))+
  scale_x_continuous(breaks=c(0,2*10^6,4*10^6,6*10^6,8*10^6),labels=c("0","2","4","6","8"))+
  theme_bw()

This should produce something like this:

enter image description here

You could also do this on a smaller scale by limiting pairwise comparisons to smaller window size thereby getting more finer on the lower end. Perhaps this is what you want. Otherwise, you look at this and think this is not as interesting as you thought it would be. Then, what you really want to see is the LD block length distribution. That continues as a new answer as I am out of space.

ADD COMMENTlink modified 16 months ago • written 16 months ago by rmf650
1

For fast LD computation directly from a VCF file, see here.

ADD REPLYlink written 7 months ago by rmf650

Thanks for your comment. That was helpful.

ADD REPLYlink modified 13 months ago • written 13 months ago by HG20

Thanks for sharing that! It helped me a lot! Let me a question please? 1) You created mid column in the command:

dfr1 <- dfr1 %>% mutate(start=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"^[0-9-e+.]+")),
                        end=as.integer(str_extract(str_replace_all(distc,"[\\(\\)\\[\\]]",""),"[0-9-e+.]+$")),
                        mid=start+((end-start)/2))

But it was not used during the plotting. Am i right? If so, why we need this column?

Thanks, Denis

ADD REPLYlink modified 7 months ago • written 7 months ago by Denis70
1

You are right. I don't use it here. The mid variable is just used to find the mid point of the interval windows. When plotting, you could use mid instead of start. That was my plan, but then it turns out, it doesn't make much difference for this particular plot.

ADD REPLYlink written 7 months ago by rmf650

Hi, This post is very helpful. I had a doubt with the following code:

cat snp-thin.ld | sed 1,1d | awk -F " " 'function abs(v) {return v < 0 ? -v : v}BEGIN{OFS="\t"}{print abs($5-$2),$7}' | sort -k1,1n > snp-thin.ld.summary

I was wondering whether the "cat" is a function in plink/mapthin or it refers to separate software? Thanks in advance. Manju

ADD REPLYlink written 5 months ago by manjusst0

Hi! cat is linux shell command. Please check: http://www.linfo.org/cat.html

ADD REPLYlink written 5 months ago by Denis70

Thnaks for sharing this tutorial. I have a question about mapthin: I used .bim files to thin my data, but when I run PLINK to do LD, I get an error message ( Invalid .bed file size (expected 22158 bytes)). That means my .bed file is steal unchanged (contains all genotypes).

My question is : How to extract genotypes for SNPs contained in thinned bim file from bed file? to get a bed file with genotypes for thinned markers.

Thanks, Abdel

ADD REPLYlink written 4 months ago by Abdel20

This post is great!

I am getting into the same issue, did you find a solution? Would you mind to share?

Thanks,

Beatriz

ADD REPLYlink written 12 weeks ago by beanavarro850

I had the same issue as you, and I think I figured out a way to re-integrate the mapthin output.

Assuming that, as described above, you ran mapthin as "mapthin -b 20 snp.bim snp-thin.bim"

You can then run the following to reformat the sites in your original bfile according to the mapthin output: plink --bfile snp --extract snp-thin.bim --recode --make-bed --out snp-thin2

You'll then get updated files and should be able to use --bfile snp-thin2 for plink analyses.

I hope this helps!

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by rando220
5
gravatar for rmf
16 months ago by
rmf650
rmf650 wrote:

For LD block length distribution, I ran this on the thinned SNP set.

plink --bfile "snp-thin"  --blocks --blocks-max-kb 200 --out "snp-thin"

This produces snp-thin.blocks.det. Then some R.

dfr <- read.delim("snp-thin.blocks.det",sep="",header=T,check.names=F,stringsAsFactors=F)
colnames(dfr) <- tolower(colnames(dfr))

# ld block density
p <- ggplot(dfr,aes(x=kb))+
  geom_density(size=0.5,colour="grey40")+
  labs(x="LD block length (Kb)",y="Density")+
  theme_bw()

ggsave("snp-thin-ld-blocks.png",p,height=8,width=8,units="cm",dpi=250)

enter image description here

Now you see LD block length distribution for your set of samples.

ADD COMMENTlink written 16 months ago by rmf650
1

I had so much fun learning with you. Your answers were very insightful and interactive, so thank you for sharing your wisdom with me.

ADD REPLYlink written 16 months ago by adele30
2
gravatar for hewm2008
15 months ago by
hewm200820
hewm200820 wrote:

I thinks this software(PopLDdecay) can help you to draw LDdecay Figure directly, https://github.com/BGI-shenzhen/PopLDdecay

The simplest usage

./bin/PopLDdecay    -InVCF  SNP.vcf.gz  -OutStat LDdecay   [#other para:   -SubPop    GroupA_sample.list]
  perl  bin/Plot_OnePop.pl  -inFile   LDdecay.stat.gz  -output  Fig

Use Data of this web site to test follow software ( ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ ) only with two based site in chr22 (minimal SNP database) of the 1000 Genomes Project. ALL the pair-wise SNP R^2 is the same.

| 板本号        |     平均内存      |  核心程序计算时间       | 输入格式转化和结果统计时间        | 输出结果|
| version       |    Average memory |  Core calculation CPUs  | Format conver&Statistics Time CPU| result size|

|Plink 1.07     |    1.4G           |     680min              | 5 min+45min     |         54G
|Plink 2.0      |    18.807G        |      25min              | 5 min+45min     |         54G
|Haploview 4.2  |    95.760G        |     3904min             | 5 min+45min     |         54G
|PopLDdcay 3.30 |    2.5G           |     200min              |  0min           |         4.1M

if all chr (1-22) were calculated, it will take too much hard disk space to stored the temporary result.

ADD COMMENTlink modified 15 months ago • written 15 months ago by hewm200820

Dear Sir, I sent you an email (hewm2008@gmail.com,hewm2008@qq.com) with an issue on PopLDdecay. Could you please have a look? Manju

ADD REPLYlink written 5 months ago by manjusst0

I have already sent back the mail. any question are welcome.

ADD REPLYlink written 5 months ago by hewm200820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1920 users visited in the last hour