How do I compute ld blocks from the hapmap ld_data?
2
1
Entering edit mode
2.7 years ago
endrebak ▴ 920

When people ask for ld blocks on this forum, this link is often posted:

ftp://ftp.ncbi.nlm.nih.gov/hapmap/ld_data/latest

However, those files merely contain the correlations between snps. How do I use these to compute ld-blocks?

Would be nice if you could post an example script or invocation of a tool, not merely post a link :)

Thank you!

ld snps • 2.1k views
6
Entering edit mode
2.7 years ago

The contents of those files are explained in the README; however, I'm not sure how well supported is that format specification (if it is even defined as a format specification).

I have an imputed 1000 Genomes Phase III dataset on my computer that I utilise for various projects —including LD analyses— when needed. If you follow this tutorial, you'll also have this imputed dataset: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

Once you have all of those files, generating a LD heatmap for any region is as simple as this, taken from my notes:

## 1, extract variants overlapping the region of interest +/- 1Kbp from the 1000 Genomes Phase III data

bcftools view chr3.1kg.phase3.v5.vcf.gz -R Gene_hg19.bed | \
bcftools annotate -Ob -I +'%CHROM:%POS:%REF:%ALT' > Region.bcf ;
bcftools index Region.bcf ;


Here, the co-ordinates of the region of interest are stored in the BED file. I also set the VCF ID field to CHROM:POS:REF:ALT if it's not already set

## 2, extract different populations

bcftools view -Ob --samples-file EAS.list Region.bcf > EAS.bcf ;
bcftools index EAS.bcf ;


Here, EAS.list comprises a list of sample IDs representing the East Asian (EAS) super population. I identified these from 20130606_g1k.ped

## 3, convert the BCF file to PLINK format (requires PLINK >= 1.9)

plink --noweb --bcf EAS.bcf \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out EAS ;


## 4, export data to HaploView format

plink --noweb --bfile EAS \
--chr 3 --from-bp 135364584 --to-bp 135399507 \
--snps-only no-DI \
--recodeHV \
--out EAS_Haploview ;


## 5, determine haplotype blocks in HaploView

I have written the following notes:

Start HaploView. In the ‘Welcome to HaploView’ window, select the ‘LINKAGE Format’ tab. Then, click on the ‘browse’ button and select your PED file as ‘Data File’ - this is the PED file that you generated from PLINK.

Click ‘OK’.

Select the ‘LD Plot’ tab to generate a LD heatmap. To save the plot as a Portable Network Graphics (PNG) file, go to File > Export current tab to PNG

## 6, LD heatmap in R

Then, you can read that HaploView-exported file back into R to mess around with it:

require(png)

par(mar=c(0,0,0,0), mfrow=c(4,1))

plot(1:2, type='n', main="", xlab="", ylab="", xaxt="n", yaxt="n", axes=FALSE)
lim <- par()
rasterImage(haploblocks, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4]-0.2)

require(SDMTools)
require(RColorBrewer)
points <- cbind(x=c(1.925, 1.96, 1.96, 1.925), y=c(1.5, 1.5, 1.1, 1.1))


NB - the gene track I added also via R, and also via readPNG(). I previously downloaded it from UCSC and saved it as a separate PNG

Kevin

0
Entering edit mode

I thank you for your reply. I should have been more explicit; I wanted bed files with the ld-blocks to use in my analyses.

0
Entering edit mode

Then, first, you have to identify the LD blocks, which are calculated at the SNP level and not the base-pair level, per se

0
Entering edit mode

Ah, I had no idea. That makes sense :)

0
Entering edit mode

Thank you for this tutorial! It helps immensely in my work. May I ask what is imputation? My beginner understanding is that we use reference datasets like 1000 Genome to fill in missing genotypes in our data. So then how and why did you impute the 1000 genome data? Do we need to do it before all analysis?

1
Entering edit mode

Yes, that is the general procedure in imputation. We don't have to do it for the data that I used above, as it is already imputed against some other reference panel.

You don't technically have to do any imputation with your own data. We would do it if some of our SNPs failed and/or there was low SNP density / coverage across the region on which we wished to derive haplo-types and -blocks.

1
Entering edit mode

Thank you for clarifying! I am getting lots of insights from you.

2
Entering edit mode
2.7 years ago
endrebak ▴ 920
0
Entering edit mode

Great - thank you.