Question: 1000 Genomes Ld Calculation
25
gravatar for Ryan D
8.9 years ago by
Ryan D3.3k
USA
Ryan D3.3k wrote:

Having no genotype data and only meta-analysis data to proceed on, we want to find all SNPs in linkage disequilibrium (LD) with a significant result from compiled GWAS. So if we have a dbSNP rs#, sweet tools like SNAP http://www.broadinstitute.org/mpg/snap/ldsearch.php can nail down all SNPs in Hapmap 2/3 that have some LD with such a SNP. Does a comparable tool or relatively simple means exist for doing this on 1000G data? Or am I in for a trip to the dentist?

genome linkage gwas • 30k views
ADD COMMENTlink modified 5.5 years ago by Leandro Lima930 • written 8.9 years ago by Ryan D3.3k

Interesting. I'm interested in if LD have been already computed by someone for these genomes at all? Would it be computationally feasible?

ADD REPLYlink written 8.8 years ago by Michael Dondrup46k

I don't see that happening until the dataset is frozen, until no more individuals will be sequenced and added. Yes, I think it would be feasible.

ADD REPLYlink written 8.8 years ago by Larry_Parnell16k

Thanks everyone. I signed up for the 1000G seminar at ASHG as well. I'm curious, Larry, what sort of format do you load the data into Haploview or HelixTree? We have the phased haplotypes. The Asian (CHB+JPT) dataset we're interested in as of the June 2010 freeze has 62 individuals in it so far. My colleague also indicates that he has downloaded ped files for PLINK from the MACH website. I have not tried it yet, but the link is here: http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G-2010-06.html

ADD REPLYlink written 8.8 years ago by Ryan D3.3k

Well, I wrote what we >would< do - we don't do this yet. And it is my colleagues you would be ones to run those data. I'll have to ask. Check back later...

ADD REPLYlink written 8.8 years ago by Larry_Parnell16k

Some data exists on this LD. I ended up pulling data from files with code like this: zcat ~/1000Genomes/2010-06/CEU/LD/xt/chr19.xt.gz | grep -w rs11671664 | awk '$4 > 0.5'

rs11670375 rs11671664 0.8961 0.5673 A,G
chr19:50848886 rs11671664 0.8990 0.7340 C,G
rs11083777 rs11671664 0.8990 0.7340 G,G
chr19:50851826 rs11671664 0.8899 0.7134 C,G
chr19:50852809 rs11671664 0.8990 0.7340 A,G
chr19:50853145 rs11671664 0.8990 0.7340 G,G
rs4375772 rs11671664 0.8899 0.7134 C,G
rs11671664 chr19:50865055 1.0000 0.6138 G,G
ADD REPLYlink modified 6.5 years ago by brentp23k • written 8.8 years ago by Ryan D3.3k

Hi Ryan, where did you find the LD data on the 1000 genome site? Could you put the link to the file? Would help me big time. Thanks

ADD REPLYlink written 8.8 years ago by Michael Dondrup46k
12
gravatar for Ryan D
7.6 years ago by
Ryan D3.3k
USA
Ryan D3.3k wrote:

This has been answered elsewhere by now, but thanks to all who gave input: It requires downloading tabix and vcftools. Use tabix to download the relevant region from 1kG like so:

tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf

Now use vcftools to output the variants into PLINK format. TPED seems to work best for some versions.

./vcftools --vcf ~/genotypes.vcf --plink-tped --out plinkformat

Now use PLINK to convert to bed/bim/fam filetypes and run whatever LD tests you wish on those SNPs, such as:

plink --tped plinkformat.tped --tfam plinkformat.tfam --make-bed --out ~/delete
plink --bfile delete --ld rs961253 rs2423279
plink --bfile delete --r2 --ld-snp-list list.txt --ld-window-kb 1000 --ld-window 99999 --ld-window-r2 0 --out ld_results

Thanks to Stephen of Getting Genetics Done, and Michael for his comments below.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Ryan D3.3k
2

Hi Michael, if you are getting that error, try this with vcftools instead: ./vcftools --vcf ~/genotypes.vcf --plink-tped --out plinkformat plink --tped plinkformat.tped --tfam plinkformat.tfam --make-bed --out ~/delete

As for a specific pair of SNPs: plink --bfile delete --ld rs961253 rs2423279

ADD REPLYlink written 7.6 years ago by Ryan D3.3k
2

With PLINK 1.9, vcftools step can be dropped, as plink can read vcf files as input. So it would be just 2 lines tabix to get the vcf file and then plink to get ld: plink --vcf genotypes.vcf --ld rs34680782 rs711752

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by zx87547.9k

Thanks for pointing this out with plink 1.9. I suppose it wasn't asked in the OPs question, but a way to get the classic "LD triangle" is to select a region of vcf with tabix and then run plink 1.9 with "--r2 triangle" (or "--r2 square", then you can just plot a heatmap). 1000 genomes data also produces a pretty "sparse" corrleations with all snps included too, so maybe add --maf 0.01 or similar to filter very rare snps (--write-snplist will then tell you the list of variants that were used)

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by cmdcolin1.2k
1

cool, thanks a ton. +2 (unfortunately not possible)

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k
1

Ryan D. This was very useful. I made it into a simple script:

<script src="&lt;a href=" brentp="" 5050522"="">brentp/5050522"></script>
If anyone uses it, let me know any modifications.

ADD REPLYlink written 6.5 years ago by brentp23k
1

Fixed link http://gist.github.com/brentp/5050522

ADD REPLYlink written 3.0 years ago by cmdcolin1.2k

Hi brentp, I tried to run your linkage-blocks.py script but I get the following error message: File "linkage-blocks.py", line 40 print "chrom\tstart\tend\tblock_size" ^ SyntaxError: invalid syntax

ADD REPLYlink written 5.4 years ago by 71490

I'm using python < 3. Looks like you're using python > 3. I'm pretty sure cpv won't work on python3, so you'd have to try python 2.7.

ADD REPLYlink written 5.4 years ago by brentp23k

Hi Ryan, do know if that also will give me the r-squared values for snp pairs?

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

When trying this, I get "ERROR: No file [ plinkformat.fam ] exists" in plink. Some step missing there?

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

I tried this using a bed file of GWAS snps but I got the following error "ti_index_core] the indexes overlap or are out of bounds". I used hg19_gwas_snps.bed and hg19 1kg data ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/

ADD REPLYlink written 7.5 years ago by Gireesh K. Bogu100

I tried 1st tabix command using a bed file of GWAS snps but I got the following error "ti_index_core] the indexes overlap or are out of bounds". I used hg19_gwas_snps.bed and hg19 1kg data ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/

ADD REPLYlink written 7.5 years ago by Gireesh K. Bogu100

I've recently heard that as passing data from vcf to plink, phase information is lost... Is it true?

ADD REPLYlink written 6.9 years ago by Peixe580
1

VCFtools outputs alleles in the order that they are found in the VCF file, but PLINK doesn't preserve phase information. The phase will be lost the the order is preserved in the file. So in theory it could be recovered.

ADD REPLYlink written 6.9 years ago by Ryan D3.3k

So in order to keep it, should the "--keep-allele-order" option, isn't it?

ADD REPLYlink written 6.9 years ago by Peixe580

That looks right: http://www.biostars.org/post/show/7866/phased-data-in-plink/

ADD REPLYlink written 6.9 years ago by Ryan D3.3k
8
gravatar for Leandro Lima
5.5 years ago by
Leandro Lima930
San Francisco, CA
Leandro Lima930 wrote:

Just an update…

The LD data is available here https://statgen.sph.umich.edu/locuszoom/download/

ADD COMMENTlink written 5.5 years ago by Leandro Lima930
1

Thanks Leandro, great resource!

ADD REPLYlink written 5.0 years ago by Khader Shameer18k
6
gravatar for Larry_Parnell
8.8 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

Here is what we would do:

The minimum number of individuals that you need to calculate LD with confidence is 50 and they must be unrelated. Using data from 100 individuals would be much better, also unrelated. Our tool of choice is HelixTree from GoldenHelix. My colleague says that Haploview will also work. The input data are the variants; you don't need the invariant sequence. We would use a length of sequence from 200 to 500 kbp on either side of the region/gene of interest. Yes this is big, but until you calculate LD, you don't know how far it stretches from the variant(s) of interest.

ADD COMMENTlink written 8.8 years ago by Larry_Parnell16k

of course if you plan to study a particular region programs like HelixTree or Haploview will do, but if you want to have LD measures across the entire human genome there is no program yet (none that I am aware of) that can handle the large variation (~14M variants) covered by this project. but I definitely see that focusing in particular regions is, as a workaround, an intelligent approach.

ADD REPLYlink written 8.8 years ago by Jorge Amigo11k
5
gravatar for Mary
8.8 years ago by
Mary11k
Boston MA area
Mary11k wrote:

Hey folks--I just wanted to add a couple of tidbits from the 1000G tutorial session at ASHG.

  1. The whole session was videotaped. It will eventually be available on genome.gov and 1000G site. I will keep an eye out for that and let you know. Their slide presentations will also be available.

  2. In Paul Flicek's session he showed LD data in the browser on the 1000G site, but his slide says that it is "Currently based on data from HapMap and Perlegen populations" and "Populations selectable from drop down tab". I haven't had a chance to look for this yet and I have my crappy road computer so I won't even bother right now.

  3. Someone specifically asked about LD on this data, and Paul answered that no LD tools exist right now for this. So I would say if you are going to wrestle with this it will definitely be toothache time.

PS: someone talked about having downloaded the variation files (?) on the day the paper was released, and that it seemed to be a subset. But they said in the session that a new file had gone up at 2pm the afternoon of that session and it was MUCH bigger--so if you haven't looked recently you might want to check out the files again. There were also supposedly changes to the browser that day as well.

ADD COMMENTlink written 8.8 years ago by Mary11k
5
gravatar for Brian Shirts
7.8 years ago by
Brian Shirts50
Brian Shirts50 wrote:

If you look up a single SNP in the Ensembl genome browser, there is a linkage disequilibrium tab on the left. 3 of the populations listed are from 1000 genomes pilot data. I am not sure what the inner workings are so if anyone can clarify what Ensembl is posting that would be great, but the few LD calculations I have checked appear correct. Even for SNPs not in Hapmap. Ensembl is the best tool for this I have seen so far.

ADD COMMENTlink written 7.8 years ago by Brian Shirts50
4
gravatar for Michael Dondrup
6.1 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Here is a Makefile.in which I used to generate LD data from 1kG phase1 data. It can do everything from downloading the vcf files to calculation LD using Intersnp or Plink. Some perl scripts and binaries are missing, but that should serve as an example at least. One can use make -j 8 to run the processes in parallel, but make download should have run first.

#PWD=pwd
#DIR=`basename $(PWD)`
mirror=ftp://ftp.1000genomes.ebi.ac.uk/vol1
ftpurl1=$(mirror)/ftp/release/20110521/ALL.chr
ftpurl2=.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz
#tabix -fh $ftpurl > genotypes.vcf
POPULATION=EUR

intersnp : intersnpoutputhapfile.txt

plink : plinkoutput.ld

download : genotypes.vcf


panel :
        grep -e$(POPULATION) ../phase1_integrated_calls.20101123.ALL.panel | cut -f1 > panel

genotypes.vcf :
        wget -c $(ftpurl1)`basename $(CURDIR)`$(ftpurl2) -O genotypes.vcf.gz
        gunzip genotypes.vcf.gz

genotypes.subset.vcf : genotypes.vcf panel
        vcf-cut.pl -c panel genotypes.vcf > genotypes.subset.vcf 

plinkformat.tfam plinkformat.tped : genotypes.subset.vcf
        vcftools --vcf genotypes.subset.vcf --plink-tped --out plinkformat

plinkBEDformat.bed plinkBEDformat.fam : plinkformat.tfam 
        plink --tfile plinkformat --make-bed --out plinkBEDformat --noweb --maf 0.002 --hwe 0.001
        mv plinkBEDformat.fam plinkBEDformat.fam.tmp
        cat plinkBEDformat.fam.tmp | sed 's/-9/1/g' > plinkBEDformat.fam

intersnpoutputhapfile.txt : plinkBEDformat.bed
        time intersnp sfile.txt

hapmapFormat.txt : intersnpoutputhapfile.txt
        reformatIntersnpVcf.pl -v genotypes.subset.vcf -p EUR intersnpoutputhapfile.txt > hapmapFormat.txt

plinkoutput.ld : plinkBEDformat.bed
        time plink --bfile plinkBEDformat  --ld-window-kb 1000 --ld-window 99999  --out plinkoutput --noweb 

clean :
        -$(RM) panel plinkformat.* plinkBEDformat.* genotypes.subset.vcf*  intersnpoutput* plinkoutput*

realclean : clean
        -$(RM) genotypes.vcf*

Here is the root Makefile, run make setup to create subdirectories:

SUBDIRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y


setup: 
        for i in $(SUBDIRS); \
        do \
                echo ========= preparing $$i ; \
                (mkdir -p $$i ; cp -v ./Makefile.in $$i/Makefile ; cp -v ./sfile.txt $$i/sfile.txt ) ; \
        done

$(SUBDIRS)::
        $(MAKE) -C $@ $(MAKECMDGOALS)

all clean intersnp hapmapFormat.txt download plink realclean: $(SUBDIRS)

#.DEFAULT: 
#       for i in $(SUBDIRS); \
#        do \
#                echo ========= Making in $$i ; \
#                (cd $$i ; $(MAKE) -$(MAKEFLAGS) $@ ) ; \
#        done
ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Michael Dondrup46k
3
gravatar for Jorge Amigo
8.9 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

as far as I know, there are no tools yet working on 1000 Genomes data that deal with LD. the official browser allows only browsing the data, but the kind of information we are used to by HapMap is not yet present, although I do not doubt that they will soon provide it.

we were in a similar situation months ago, and we had in fact to adapt our own tool for browsing population statistics SPSmart in order to accept data from 1000 Genomes, and therefore to extract allele frequencies and Fst values from their pilot datasets. I guess you should start preparing your teeth...

ADD COMMENTlink written 8.9 years ago by Jorge Amigo11k
3
gravatar for Floris Brenk
5.6 years ago by
Floris Brenk890
USA
Floris Brenk890 wrote:

Hello, I'm liking all the methods mentioned here, but still it is not as user friendly for non bioinformaticians as the SNAP tool. I'm still looking for some program/script where I can just paste a list of SNPs (500+) and get all the european population SNPs r2 >0.8 as output... Hope someone can help me out!

ADD COMMENTlink written 5.6 years ago by Floris Brenk890
2

If you have PLINK-format 1000g files, you can use

plink --bfile [1000g fileset prefix] --keep [list of EUR samples] --r2 --inter-chr --ld-snp-list [file with list of SNP IDs] --ld-window-r2 0.8

(If you do not have them, they will be posted online within a few months.)

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by chrchang5235.5k
3

hmmm ok so this would do the trick you reckon?

vcftools --vcf chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf --out TRY_OUT.plink --plink

> VCFtools - v0.1.7 (C) Adam Auton 2009
> 
> Parameters as interpreted:     --vcf
> chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.EUR.vcf
>     --out TRY_OUT.plink     --plink
> 
> VCF index is older than VCF file. Will regenerate. Building new index
> file.     Scanning Chromosome: 22     Warning - file contains entries with
> the same position. This is not supported by vcftools, and may cause
> unexpected behaviour.
> 
> Writing Index file. File contains 232005 entries and 379 individuals.
> Applying Required Filters. After filtering, kept 379 out of 379
> Individuals After filtering, kept 232005 out of a possible 232005
> Sites Writing PLINK PED file ...  Writing PLINK MAP file ... Done. Run
> Time = 28.00 seconds

plink --file TRY_OUT.plink --r2 --inter-chr --ld-snp-list chr22snp.txt --ld-window-r2 0.8 --out TRY_chr22_LD_r08 --noweb

@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ TRY_chr22_LD_r08.log ]
Analysis started: Thu Jan 16 15:19:43 2014

Options in effect:
    --file TRY_OUT.plink
    --r2
    --inter-chr
    --ld-snp-list chr22snp.txt
    --ld-window-r2 0.8
    --out TRY_chr22_LD_r08
    --noweb

232005 (of 232005) markers to be included from [ TRY_OUT.plink.map ]
Warning, found 379 individuals with ambiguous sex codes
These individuals will be set to missing ( or use --allow-no-sex )
Writing list of these individuals to [ TRY_chr22_LD_r08.nosex ]
379 individuals read from [ TRY_OUT.plink.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 379 missing
0 males, 0 females, and 379 of unspecified sex
Before frequency and genotyping pruning, there are 232005 SNPs
379 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 232005 SNPs
After filtering, 0 cases, 0 controls and 379 missing
After filtering, 0 males, 0 females, and 379 of unspecified sex
Writing LD statistics to [ TRY_chr22_LD_r08.ld ] 

Analysis finished: Thu Jan 16 15:21:03 2014
ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Floris Brenk890
2
gravatar for Mary
8.8 years ago by
Mary11k
Boston MA area
Mary11k wrote:

FYI: if anyone is going to ASHG there's a tutorial on 1000 Genomes data usage. Daniel MacArthur gets the hat tip for this:

RT @dgmacarthur: Attending #ASHG2010 and interested in using data from the 1000 Genomes Project? Sign up here: http://bit.ly/9X29xC

I've registered for it if anyone wants to introduce themselves there.

ADD COMMENTlink written 8.8 years ago by Mary11k
1

Hopefully, I will be @ ASHG too. I've registered but bureaucratic snafus may not actually allow me to go

ADD REPLYlink written 8.8 years ago by Larry_Parnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2003 users visited in the last hour