Question: Is it possible to fetch 1000 genomes project v37 from Entrez?
gravatar for bfeeny
5.0 years ago by
United States
bfeeny30 wrote:


Right now I am working with GRCh37.p13 RefSeq data from Entrez, querying using efetch in Biopython, here is a sample of the type of query I am doing (Chromosome 1):

net_handle = Entrez.efetch(db="nucleotide",id="NC_000001.10",rettype="fasta", retmode="text")

This result is drastically different than the reference chromosome 1 from 1000 genomes project.  Specifically whats contained in their file human_g1k_v37.fasta.gz which I obtained from

I am working with analyzing SNP's and my understanding is that Ancestry, 23andMe and others typically use 1000 genomes data as their reference, that is why I am looking to use it as well.  If this is incorrect, and they in fact use another reference you are aware of please let me know.

What would be ideal is if I could query the data from human_g1k_v37 via a RefSeq or some other means using the Entrez service.  Does anyone know if this is possible?








entrez biopython 1000genomes ncbi • 1.4k views
ADD COMMENTlink modified 5.0 years ago by Devon Ryan94k • written 5.0 years ago by bfeeny30
gravatar for Devon Ryan
5.0 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

I downloaded NC_000001.10 from NCBI and the fasta file you mentioned from 1000 genomes:

$ samtools faidx human_g1k_v37.fasta 1 | grep -v ">" | tr -d '\n' | md5sum
1b22b98cdeb4a9304cb5d48026a85128  -
$ cat NC_000001.10.fasta | grep -v ">" | tr -d '\n' | md5sum
1b22b98cdeb4a9304cb5d48026a85128  -

So they are, in fact, identical except for the chromosome names, as expected. Presuming you need to check a good number of sequences, just download the fasta file and query that. It'll be faster anyway.

ADD COMMENTlink written 5.0 years ago by Devon Ryan94k

Devon, thank you for responding.  I am rather new to working with all these tools and bioinformatics in general.  I did a very naive sdiff of the two files and "stare and compare" approach which led me astray, since each had different widths, sdiff was abbreviating some information.  I really like the md5 pipeline you showed me for making sure my data is the same.  Its great to know the data is the same and makes a lot of sense.  

ADD REPLYlink written 5.0 years ago by bfeeny30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 955 users visited in the last hour