I have also looked at the LDheatmap package in R. But it seems the calculation of LD is based on a SNP information from a study, i.e. one need to provide the genotype of each participants. The snpStats package in Bioconductor seems to serve the same purpose. So these two methods do not seem to fit my situation.
Is there any way I could extract the LD matrix (R2) for CEU population, based on HapMap 22 dataset or 1000G dataset? (Considering I have a lot of lists of SNPs)
If you'd like to use data from the 1000 Genomes project, it's possible to download VCF files and use PLINK 1.9, an extremely fast command line tool for working with genomic data. Briefly, the workflow would be something along the lines of:
Use --make-bed to create binary fileset from the VCF;
Create a file listing all the variants of interest;
Use PLINK's LD functions (namely --r2) on the binary fileset, providing your variant list as input to the --ld-snp-list argument.
PLINK primarily allows for calculation of the r2 statistic, but can can also return pairwise D or D' if specified. These can be returned in either table or matrix format. Also - I see that you're also looking for SNPs where r2 > 0.8. In this case, you can use the --ld-window-r2 argument in the --r2 command to filter out any SNP pairs with r2 below that value.
I downloaded the 1000 Genomes phase 3 VCF files from this FTP folder [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/].
I wonder if the one named " " covers chr 1-22? I suspect "wgs" means whole genome, but it seems to me the file size not that big, comparing to the separated files of chr1 to chr22.
I also try plink on my PC, I already put the plink file in the same folder where I store the VCF file.
Then I use this command line
Can I get the linkage across the entire chromosome in Ensembl? It seems that it can not display since "The region you have selected is too large to display linkage data; a maximum region of 75kb is allowed. Please change the region using the navigation controls above."
The web-page has a limit of 75kb and the REST-API endpoint. In principle, there is no size limit on getting LD using the Perl API, but in practice the load would be so much that your computer would just explode.
The limits match up to what we actually think will be in LD. Beyond that, variants are so unlikely to be in LD then there's no point in pretending to calculate it.
Thank you. It is good to know this. But in my case, I have a list of SNPs (may be from different chromosomes), and I want to obtain the LD between them (those in the same chromosomes). So I think I have to obtain this from a comprehensive source, since I have too many SNPs, I do not want to manually search it on any website.
How big is your list? Do you just want to compare them to each other? You could use the Perl or REST APIs to get pairwise LD for each (perhaps with a chromosome filter) and print to a table of your own design.
If I have only two lists, each with less than 10 SNPs, I can manually search on Ensemble. But I have more than a hundred lists, and each list may have different number of SNPs. I have two purposes: 1) To generate the LD matrix for each list of SNPs; 2) To identify SNPs with R2>= 0.8 within each list.
You could loop through the list, or when you have a list of SNPs, better just to download the 1000 genome region, and calculate LD locally. This solution is good when we have small number of SNPs.
I downloaded the 1000 Genomes phase 3 VCF files from this FTP folder [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/]. I wonder if the one named " " covers chr 1-22? I suspect "wgs" means whole genome, but it seems to me the file size not that big, comparing to the separated files of chr1 to chr22.
I also try plink on my PC, I already put the plink file in the same folder where I store the VCF file. Then I use this command line
plink --vcf ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf --make-bed --out binary_fileset
It returns error: Failed to open ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf
Any idea what is wrong?