Question

Linkage disequilibrium decay sanity check

0

Entering edit mode

7.4 years ago

biogirl ▴ 210

Hi guys,

I'm looking at linkage disequilibrium using r2 statistic from Plink, and just basically want to run by two things:

1) A sanity check for what I've run in plink 2) Looking at R code to plot r2 over distance

For plink, I used an ld-window-kb size of 1 (so 1000 bp) and removed the default ld-window-r2 and set it to 0. Also included the --r2 flag (of course!). I've seen many different approaches to calculating r2 in plink now, and just want a sanity check that this is ok.

Secondly, which is the best why to visualize LD decay over distance? I've seen plots that are just the absolute distance and r2 (which is very busy), and others that take an average in a given window. Which is best?

Thanks all

snp plink linkage disequilibrium • 4.0k views

ADD COMMENT • link updated 7.2 years ago by willgilks ▴ 360 • written 7.4 years ago by biogirl ▴ 210

score 2 · Answer 1 · 2017-02-03

Hi biogirl,

I've found plink simple to use, and very fast for calculating r2. If you want to see where the r2 levels-out, so there's no point in going up to 1Mb. In fruit flies, the r2 seems to level-out after about 200bp although results have been published at much further distances. That having been said, places like the HLA region have long-range LD, so if you're studying that ... go 1MB. You can try on different settings of course. Averaging LD across genomes might miss interesting effects, so something like an LD-manhattan plot might be interesting.

I've posted some of my code using plink for calculating r2, and R for plotting the results. Ideally, the bash script would do the calculations in plink, and then automatically plot the correct graph with R, dependent on user command line input.

Calculate r2 across all samples and regions, 10Kb window, maf > 0.1.

mkdir results
plink1.9 --allow-extra-chr --allow-no-sex \
--bfile ${gname} \
--ld-window-kb 10 \
--r2 \
--maf 0.1 \ ## Because it runs faster, and low-frequency variant statistics 'can' be inaccurate.
    --out results/${gname}_ld

Start the R script. Safe 'vanilla' settings, log using tee.

R --vanilla < plot_popgen.R | tee mylog_plot_popgen.txt

R code for formatting and plotting LD decay with distance separated by chromosome.

Packages:

 library(tidyr) ## re-arranging data frames.
 library(dplyr) ## fast and simple coding.
 library(car) ## recoding character variables.
 library(ggplot2) ## fancy plots.

Custom theme for plots because the native ggplot ones are not perfect:

 theme_ld = function() {theme_bw(base_size = 10) +
     theme(panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.border = element_rect(color = "grey", size = 0.5),
      axis.line.x = element_line(size = .5, colour = "black"), 
      axis.line.y = element_line(size = .5, colour = "black"), 
      strip.background = element_rect(colour = "white", fill = "white"))}
 attr(theme_ld(), "complete")

Read in ye olde data. Takes a minute or so to load 24 million rows with 8G ram.

 ld = read.table("my_data.ld", header = TRUE)

 CHR_A  BP_A    SNP_A CHR_B  BP_B    SNP_B       R2
1     2 28813 hc_hnmve     2 34345 hc_iwmqn 0.604879
2     2 28813 hc_hnmve     2 35460 hc_tslfx 0.910921
3     2 28813 hc_hnmve     2 36330 hc_bgaeq 0.932208
4     2 28813 hc_hnmve     2 36631 hc_ohnky 0.955198
5     2 34345 hc_iwmqn     2 35460 hc_tslfx 0.600348

Format ye olde data using dplyr ("mutate" means add or change a column).

 lddat = ld %>%
   sample_frac(.1) %>% ## Take a small, random subset because big data is big.
   mutate(kb = (BP_B - BP_A) / 1000 ) %>% ## Get the distance between each marker pair in kilobases.
   select(CHR_A, BP_A, BP_B, kb, R2) %>% ## Select only the useful columns.
   filter(CHR_A != 6) %>% ## Remove uninteresting chromosomes. 
   mutate(CHR_A = recode(CHR_A, "1 = 'chrX'; 2 = 'chr2L'; 3 = 'chr2R'; 4 = 'chr3L'; 5 = 'chr3R'")) %>% ## Recode chromosome names zzz.
   mutate(mean_pos = ((BP_A + BP_B) /2) / 1e+06) %>%
   rename(chromosome = CHR_A) %>%
   arrange(chromosome, BP_A) %>% ## Sort.
   mutate(logr2 = log(R2)) ## Transform r2 to make it comparable with data published by another group.

And the amazing reformatted data looks like:

  chromosome   BP_A   BP_B    kb       R2  mean_pos       logr2
1         23 184371 184917 0.546 0.352941 0.1846440 -1.04145437
2         23 184917 192656 7.739 0.339545 0.1887865 -1.08014879
3         23 186859 191605 4.746 0.313607 0.1892320 -1.15961467
4         23 188456 191605 3.149 0.942754 0.1900305 -0.05894990
5         23 192656 194397 1.741 0.962043 0.1935265 -0.03869613
6         23 264446 268467 4.021 0.961862 0.2664565 -0.03888429

Plot ye olde data as a scatter plot with overlaid moving average, using ggplot. 'Select' just reduces the input data size. Colour (and group) by chromosome.

 ld_genome =
   ggplot(lddat %>% select(chromosome, mean_pos, R2), aes(mean_pos, R2, colour = "chromosome")) + ## Select data, specify variables.
     geom_point(size = 0.1, alpha = .1) +
     stat_smooth(se = TRUE, colour = "blue") +
     theme_ld()