Question: 1000 Genomes Api
1
gravatar for User 1933
5.9 years ago by
User 1933340
User 1933340 wrote:

So basically, I wonder if there is any API for working with 1000 genome data - to makes it easier to parse the VCF file. For example, I would like to see how many people have variants in gene "CHAT", and then generate the corresponding VCF file, and running variant effect predictor on that vcf file.

I am fine to make my own pipeline to some extend, but I could not discover how to get the VCF file with individual information for a given gene automatically.

vcf 1000genomes parsing • 2.5k views
ADD COMMENTlink modified 5.9 years ago by Pierre Lindenbaum123k • written 5.9 years ago by User 1933340
4
gravatar for Pierre Lindenbaum
5.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

a one-liner:

using the ucsc mysql server, get the coordinate of the gene "CHAT", build the 100G URL and the region, and xargs to tabix:

 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N -e  'select concat("ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.",K.chrom,".phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"),concat(RIGHT(K.chrom,LENGTH(chrom)-3),":",MIN(K.txStart)+1,"-",MAX(K.txEnd)) from knownGene as K,kgXref as X where K.name=X.kgId and X.geneSymbol="CHAT" '   | xargs ./tabix-0.2.5/tabix -h

output:

##fileformat=VCFv4.1
(...)
##reference=GRCh37
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    HG00096    HG00097    HG00099    HG00100    HG
10    50817173    rs186028903    C    T    100    PASS    ERATE=0.0005;AA=C;THETA=0.0032;RSQ=0.6818;L
10    50817210    rs80200823    G    C    100    PASS    AC=258;AA=G;AN=2184;LDAF=0.1214;VT=SNP;ERATE
10    50817321    rs190746084    G    T    100    PASS    AVGPOST=0.9963;THETA=0.0015;RSQ=0.3962;AA=G
10    50817546    rs139389571    C    A    100    PASS    LDAF=0.0052;AA=C;AVGPOST=0.9966;RSQ=0.7248;
10    50817665    rs183262975    C    A    100    PASS    THETA=0.0039;RSQ=0.6013;AA=C;AN=2184;VT=SNP
10    50817698    rs150036809    G    A    100    PASS    LDAF=0.0120;AA=G;AN=2184;RSQ=0.4837;VT=SNP;
10    50817735    rs8178981    C    T    100    PASS    AA=C;AN=2184;VT=SNP;THETA=0.0006;AC=14;RSQ=0.
10    50817751    rs12774253    C    T    100    PASS    AC=268;AA=C;RSQ=0.9685;AN=2184;THETA=0.0008;
10    50817803    rs191436474    G    T    100    PASS    ERATE=0.0005;THETA=0.0002;AA=G;AN=2184;LDAF
10    50817960    rs182461778    G    A    100    PASS    THETA=0.0002;AA=G;AN=2184;AC=20;RSQ=0.3961;
10    50818011    rs885835    C    T    100    PASS    AA=C;AN=2184;ERATE=0.0025;AVGPOST=0.9357;VT=SN
 (...)
ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Pierre Lindenbaum123k

can you make some comment what "concat(RIGHT(K.chrom,LENGTH(chrom)-3),":",MIN(K.txStart)-1,"-",MAX(K.txEnd)) from knownGene as K,kgXref as X where K.name=X.kgId and X.geneSymbol="x" is ?! thanks :)

ADD REPLYlink written 5.9 years ago by User 1933340

K.chrom,LENGTH(chrom)-3) : remove the "chr' prefix MIN(K.txStart)-1: left/min part of the area. Hum that should be +1 , not -1.. I'll fix it. ,MAX(K.txEnd): right part of the area.

see Genomic cordinates from UCSC

ADD REPLYlink written 5.9 years ago by Pierre Lindenbaum123k

How about also adding a specific region ?! say from 50817173 to 50817735 ! I appreciate your help

ADD REPLYlink written 5.9 years ago by User 1933340
2
(. where K.chrom="chr1" and NOT (K.txStart>50817735 or K.txEnd < 50817173 )
ADD REPLYlink written 5.9 years ago by Pierre Lindenbaum123k

Thanks a lot ! It was working nicely - but now I am facing

[ti_index_load] wrong magic number. [ti_index_load] fail to load the index: ALL.chr10.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi [tabix] failed to load the index file.

What could be the reason ?!

ADD REPLYlink written 5.9 years ago by User 1933340

I feel stupid to say, but this piece does not work :( !

ADD REPLYlink written 5.9 years ago by User 1933340

hi, my friend... is the SNP archive like "phase1_release_v3.20101123.snps" often updated? If it is often updated, where can I find the information to update the information like "phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz" in your one-liner?

ADD REPLYlink written 5.9 years ago by osakuraei0

I took the first VCF in the 1000G project I found. It should work for any tabix-indexed VCF on the web.

ADD REPLYlink written 5.9 years ago by Pierre Lindenbaum123k

hi, my friend, I found "The 1000 genomes snp and short indel all get submitted to dbSNP....." in this link "http://www.1000genomes.org/category/dbsnp". Is there any one-liner or programmatically method to get the similiar result from dbSNP?

If I have got the elements for our own SNP as below:

  1. Flank DNA;
  2. The gene name.

How can we retrieve data related with our own SNP from dbSNP programmatically?

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by osakuraei0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1857 users visited in the last hour