Question

1000 Genomes Api

1

Entering edit mode

11.7 years ago

User 1933 ▴ 360

So basically, I wonder if there is any API for working with 1000 genome data - to makes it easier to parse the VCF file. For example, I would like to see how many people have variants in gene "CHAT", and then generate the corresponding VCF file, and running variant effect predictor on that vcf file.

I am fine to make my own pipeline to some extend, but I could not discover how to get the VCF file with individual information for a given gene automatically.

1000genomes vcf parsing • 4.3k views

ADD COMMENT • link updated 11.7 years ago by Pierre Lindenbaum 166k • written 11.7 years ago by User 1933 ▴ 360

score 4 · Answer 1 · 2013-11-08

4

Entering edit mode

11.7 years ago

Pierre Lindenbaum 166k

a one-liner:

using the ucsc mysql server, get the coordinate of the gene "CHAT", build the 100G URL and the region, and xargs to tabix:

 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N -e  'select concat("ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.",K.chrom,".phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"),concat(RIGHT(K.chrom,LENGTH(chrom)-3),":",MIN(K.txStart)+1,"-",MAX(K.txEnd)) from knownGene as K,kgXref as X where K.name=X.kgId and X.geneSymbol="CHAT" '   | xargs ./tabix-0.2.5/tabix -h

output:

##fileformat=VCFv4.1
(...)
##reference=GRCh37
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    HG00096    HG00097    HG00099    HG00100    HG
10    50817173    rs186028903    C    T    100    PASS    ERATE=0.0005;AA=C;THETA=0.0032;RSQ=0.6818;L
10    50817210    rs80200823    G    C    100    PASS    AC=258;AA=G;AN=2184;LDAF=0.1214;VT=SNP;ERATE
10    50817321    rs190746084    G    T    100    PASS    AVGPOST=0.9963;THETA=0.0015;RSQ=0.3962;AA=G
10    50817546    rs139389571    C    A    100    PASS    LDAF=0.0052;AA=C;AVGPOST=0.9966;RSQ=0.7248;
10    50817665    rs183262975    C    A    100    PASS    THETA=0.0039;RSQ=0.6013;AA=C;AN=2184;VT=SNP
10    50817698    rs150036809    G    A    100    PASS    LDAF=0.0120;AA=G;AN=2184;RSQ=0.4837;VT=SNP;
10    50817735    rs8178981    C    T    100    PASS    AA=C;AN=2184;VT=SNP;THETA=0.0006;AC=14;RSQ=0.
10    50817751    rs12774253    C    T    100    PASS    AC=268;AA=C;RSQ=0.9685;AN=2184;THETA=0.0008;
10    50817803    rs191436474    G    T    100    PASS    ERATE=0.0005;THETA=0.0002;AA=G;AN=2184;LDAF
10    50817960    rs182461778    G    A    100    PASS    THETA=0.0002;AA=G;AN=2184;AC=20;RSQ=0.3961;
10    50818011    rs885835    C    T    100    PASS    AA=C;AN=2184;ERATE=0.0025;AVGPOST=0.9357;VT=SN
 (...)

ADD COMMENT • link 11.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

can you make some comment what "concat(RIGHT(K.chrom,LENGTH(chrom)-3),":",MIN(K.txStart)-1,"-",MAX(K.txEnd)) from knownGene as K,kgXref as X where K.name=X.kgId and X.geneSymbol="x" is ?! thanks :)

ADD REPLY • link 11.7 years ago by User 1933 ▴ 360

0

Entering edit mode

K.chrom,LENGTH(chrom)-3) : remove the "chr' prefix MIN(K.txStart)-1: left/min part of the area. Hum that should be +1 , not -1.. I'll fix it. ,MAX(K.txEnd): right part of the area.

see Genomic cordinates from UCSC

ADD REPLY • link 11.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

How about also adding a specific region ?! say from 50817173 to 50817735 ! I appreciate your help

ADD REPLY • link 11.7 years ago by User 1933 ▴ 360

2

Entering edit mode

(. where K.chrom="chr1" and NOT (K.txStart>50817735 or K.txEnd < 50817173 )

ADD REPLY • link 11.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks a lot ! It was working nicely - but now I am facing

[ti_index_load] wrong magic number. [ti_index_load] fail to load the index: ALL.chr10.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi [tabix] failed to load the index file.

What could be the reason ?!

ADD REPLY • link 11.7 years ago by User 1933 ▴ 360

0

Entering edit mode

I feel stupid to say, but this piece does not work :( !

ADD REPLY • link 11.7 years ago by User 1933 ▴ 360

0

Entering edit mode

hi, my friend... is the SNP archive like "phase1_release_v3.20101123.snps" often updated? If it is often updated, where can I find the information to update the information like "phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz" in your one-liner?

ADD REPLY • link 11.7 years ago by osakuraei • 0

0

Entering edit mode

I took the first VCF in the 1000G project I found. It should work for any tabix-indexed VCF on the web.

ADD REPLY • link 11.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

hi, my friend, I found "The 1000 genomes snp and short indel all get submitted to dbSNP....." in this link "http://www.1000genomes.org/category/dbsnp". Is there any one-liner or programmatically method to get the similiar result from dbSNP?

If I have got the elements for our own SNP as below:

Flank DNA;
The gene name.

How can we retrieve data related with our own SNP from dbSNP programmatically?

ADD REPLY • link 11.7 years ago by osakuraei • 0