Having no genotype data and only meta-analysis data to proceed on, we want to find all SNPs in linkage disequilibrium (LD) with a significant result from compiled GWAS. So if we have a dbSNP rs#, sweet tools like SNAP http://www.broadinstitute.org/mpg/snap/ldsearch.php can nail down all SNPs in Hapmap 2/3 that have some LD with such a SNP. Does a comparable tool or relatively simple means exist for doing this on 1000G data? Or am I in for a trip to the dentist?
tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf
Now use vcftools to output the variants into PLINK format. TPED seems to work best for some versions.
./vcftools --vcf ~/genotypes.vcf --plink-tped --out plinkformat
Now use PLINK to convert to bed/bim/fam filetypes and run whatever LD tests you wish on those SNPs, such as:
plink --tped plinkformat.tped --tfam plinkformat.tfam --make-bed --out ~/delete plink --bfile delete --ld rs961253 rs2423279 plink --bfile delete --r2 --ld-snp-list list.txt --ld-window-kb 1000 --ld-window 99999 --ld-window-r2 0 --out ld_results
Thanks to Stephen of Getting Genetics Done, and Michael for his comments below.
Here is what we would do:
The minimum number of individuals that you need to calculate LD with confidence is 50 and they must be unrelated. Using data from 100 individuals would be much better, also unrelated. Our tool of choice is HelixTree from GoldenHelix. My colleague says that Haploview will also work. The input data are the variants; you don't need the invariant sequence. We would use a length of sequence from 200 to 500 kbp on either side of the region/gene of interest. Yes this is big, but until you calculate LD, you don't know how far it stretches from the variant(s) of interest.
Hey folks--I just wanted to add a couple of tidbits from the 1000G tutorial session at ASHG.
The whole session was videotaped. It will eventually be available on genome.gov and 1000G site. I will keep an eye out for that and let you know. Their slide presentations will also be available.
In Paul Flicek's session he showed LD data in the browser on the 1000G site, but his slide says that it is "Currently based on data from HapMap and Perlegen populations" and "Populations selectable from drop down tab". I haven't had a chance to look for this yet and I have my crappy road computer so I won't even bother right now.
Someone specifically asked about LD on this data, and Paul answered that no LD tools exist right now for this. So I would say if you are going to wrestle with this it will definitely be toothache time.
PS: someone talked about having downloaded the variation files (?) on the day the paper was released, and that it seemed to be a subset. But they said in the session that a new file had gone up at 2pm the afternoon of that session and it was MUCH bigger--so if you haven't looked recently you might want to check out the files again. There were also supposedly changes to the browser that day as well.
If you look up a single SNP in the Ensembl genome browser, there is a linkage disequilibrium tab on the left. 3 of the populations listed are from 1000 genomes pilot data. I am not sure what the inner workings are so if anyone can clarify what Ensembl is posting that would be great, but the few LD calculations I have checked appear correct. Even for SNPs not in Hapmap. Ensembl is the best tool for this I have seen so far.
Here is a
Makefile.in which I used to generate LD data from 1kG phase1 data. It can do everything from downloading the vcf files to
calculation LD using Intersnp or Plink. Some perl scripts and binaries are missing, but that should serve as an example at least. One can use
make -j 8
to run the processes in parallel, but make download should have run first.
#PWD=pwd #DIR=`basename $(PWD)` mirror=ftp://ftp.1000genomes.ebi.ac.uk/vol1 ftpurl1=$(mirror)/ftp/release/20110521/ALL.chr ftpurl2=.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz #tabix -fh $ftpurl > genotypes.vcf POPULATION=EUR intersnp : intersnpoutputhapfile.txt plink : plinkoutput.ld download : genotypes.vcf panel : grep -e$(POPULATION) ../phase1_integrated_calls.20101123.ALL.panel | cut -f1 > panel genotypes.vcf : wget -c $(ftpurl1)`basename $(CURDIR)`$(ftpurl2) -O genotypes.vcf.gz gunzip genotypes.vcf.gz genotypes.subset.vcf : genotypes.vcf panel vcf-cut.pl -c panel genotypes.vcf > genotypes.subset.vcf plinkformat.tfam plinkformat.tped : genotypes.subset.vcf vcftools --vcf genotypes.subset.vcf --plink-tped --out plinkformat plinkBEDformat.bed plinkBEDformat.fam : plinkformat.tfam plink --tfile plinkformat --make-bed --out plinkBEDformat --noweb --maf 0.002 --hwe 0.001 mv plinkBEDformat.fam plinkBEDformat.fam.tmp cat plinkBEDformat.fam.tmp | sed 's/-9/1/g' > plinkBEDformat.fam intersnpoutputhapfile.txt : plinkBEDformat.bed time intersnp sfile.txt hapmapFormat.txt : intersnpoutputhapfile.txt reformatIntersnpVcf.pl -v genotypes.subset.vcf -p EUR intersnpoutputhapfile.txt > hapmapFormat.txt plinkoutput.ld : plinkBEDformat.bed time plink --bfile plinkBEDformat --ld-window-kb 1000 --ld-window 99999 --out plinkoutput --noweb clean : -$(RM) panel plinkformat.* plinkBEDformat.* genotypes.subset.vcf* intersnpoutput* plinkoutput* realclean : clean -$(RM) genotypes.vcf*
Here is the root Makefile, run make setup to create subdirectories:
SUBDIRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y setup: for i in $(SUBDIRS); \ do \ echo ========= preparing $$i ; \ (mkdir -p $$i ; cp -v ./Makefile.in $$i/Makefile ; cp -v ./sfile.txt $$i/sfile.txt ) ; \ done $(SUBDIRS):: $(MAKE) -C $@ $(MAKECMDGOALS) all clean intersnp hapmapFormat.txt download plink realclean: $(SUBDIRS) #.DEFAULT: # for i in $(SUBDIRS); \ # do \ # echo ========= Making in $$i ; \ # (cd $$i ; $(MAKE) -$(MAKEFLAGS) $@ ) ; \ # done
as far as I know, there are no tools yet working on 1000 Genomes data that deal with LD. the official browser allows only browsing the data, but the kind of information we are used to by HapMap is not yet present, although I do not doubt that they will soon provide it.
we were in a similar situation months ago, and we had in fact to adapt our own tool for browsing population statistics SPSmart in order to accept data from 1000 Genomes, and therefore to extract allele frequencies and Fst values from their pilot datasets. I guess you should start preparing your teeth...
Hello, I'm liking all the methods mentioned here, but still it is not as user friendly for non bioinformaticians as the SNAP tool. I'm still looking for some program/script where I can just paste a list of SNPs (500+) and get all the european population SNPs r2 >0.8 as output... Hope someone can help me out!
FYI: if anyone is going to ASHG there's a tutorial on 1000 Genomes data usage. Daniel MacArthur gets the hat tip for this:
RT @dgmacarthur: Attending #ASHG2010 and interested in using data from the 1000 Genomes Project? Sign up here: http://bit.ly/9X29xC
I've registered for it if anyone wants to introduce themselves there.