Question: Download only sequence information from NCBI
gravatar for Xapple
5.8 years ago by
Xapple30 wrote:

My problem is the following: I have a list of GI identifiers form the NCBI nucleotide database. For instance take just this one: `76365841`. I want to extract the "isolation source" term from it. The answer here is "Everglades wetlands" which you can see by using the "efetch".

However when I hit a full chromosome that has a huge sequence, my program will download the full sequence and the biopython Entrez.parser is unable to handle that. For instance with: `332640072`

Is there any way of building a request to NCBI to batch download the sequences information (including isolation source) WITHOUT downloading the actual sequence in terms of AGTC.

If you want to see the program:

    from Bio import Entrez
    gis = ['332640072', '76365841', '22506766', '389043336']
    response = Entrez.efetch(db="nucleotide", id=gis, retmode="xml")
    records = list(Entrez.parse(response, validate=True))
xml efetch python ncbi • 2.3k views
ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Xapple30
gravatar for Pierre Lindenbaum
5.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

when you parse a large XML, the idea is to skip some elements using either a STAX parser or a SAX parser  ( . Both should be available in python.

For example, the following java program would skip the content of the DNA sequence:


curl -s ",332640072,76365841,389043336&retmode=xml" |\
        java  Biostar113766 | grep '<GBSeq_sequence>'



ADD COMMENTlink written 5.8 years ago by Pierre Lindenbaum129k
gravatar for 5heikki
5.8 years ago by
5heikki8.9k wrote:

Not all sequences have that information, but for example with Entrez direct:


epost -db nuccore -id 22506766,332640072,76365841,389043336 | efetch -format docsum | xtract -element SubName | tr "\t" "\n"
PB131|Panama: Panama Province, Las Cumbres Lake|lake water at 5 m depth during dry season|9.0986 N 79.5392 W
F124|USA: Florida|Everglades wetlands
SFD1-19|USA: San Francisco Delta, Mildred Island 2000-07-20


I'm not so sure the SubName element is standard though.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by 5heikki8.9k

Thanks for the answer ! But your solution still downloads the whole >10MB sequence.

ADD REPLYlink written 5.8 years ago by Xapple30

It most certainly does not download the whole sequence

epost -db nuccore -id 332640072 | efetch -format docsum > file
du -h file
4.0K    file
ADD REPLYlink written 5.8 years ago by 5heikki8.9k

I wanted to get only the meta-information

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Xapple30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour