Question

Download only sequence information from NCBI

0

Entering edit mode

9.6 years ago

Xapple ▴ 30

My problem is the following: I have a list of GI identifiers form the NCBI nucleotide database. For instance take just this one: `76365841`. I want to extract the "isolation source" term from it. The answer here is "Everglades wetlands" which you can see by using the "efetch".

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=76365841&rettype=gb

However when I hit a full chromosome that has a huge sequence, my program will download the full sequence and the biopython Entrez.parser is unable to handle that. For instance with: `332640072`

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=332640072&rettype=gb

Is there any way of building a request to NCBI to batch download the sequences information (including isolation source) WITHOUT downloading the actual sequence in terms of AGTC.

If you want to see the program:

    #python
    from Bio import Entrez
    gis = ['332640072', '76365841', '22506766', '389043336']
    response = Entrez.efetch(db="nucleotide", id=gis, retmode="xml")
    records = list(Entrez.parse(response, validate=True))

ncbi python xml efetch • 3.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Xapple ▴ 30

Ram · Answer 1 · 2014-09-25

when you parse a large XML, the idea is to skip some elements using either a STAX parser http://en.wikipedia.org/wiki/StAX or a SAX parser (http://en.wikipedia.org/wiki/Simple_API_for_XML) . Both should be available in python.

For example, the following java program would skip the content of the DNA sequence:

javac Biostar113766.java
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=22506766,332640072,76365841,389043336&retmode=xml" |\
        java  Biostar113766 | grep '<GBSeq_sequence>'
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>

Ram · Answer 2 · 2014-09-25

0

Entering edit mode

9.6 years ago

5heikki 11k

Not all sequences have that information, but for example with Entrez direct:

epost -db nuccore -id 22506766,332640072,76365841,389043336 | efetch -format docsum | xtract -element SubName | tr "\t" "\n"
3
PB131|Panama: Panama Province, Las Cumbres Lake|lake water at 5 m depth during dry season|9.0986 N 79.5392 W
F124|USA: Florida|Everglades wetlands
SFD1-19|USA: San Francisco Delta, Mildred Island 2000-07-20

I'm not so sure the SubName element is standard though.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by 5heikki 11k

0

Entering edit mode

Thanks for the answer ! But your solution still downloads the whole >10MB sequence.

ADD REPLY • link 9.6 years ago by Xapple ▴ 30

0

Entering edit mode

It most certainly does not download the whole sequence

epost -db nuccore -id 332640072 | efetch -format docsum > file
du -h file
4.0K    file