Question: Download only sequence information from NCBI
0
gravatar for Xapple
5.4 years ago by
Xapple30
Sweden
Xapple30 wrote:

My problem is the following: I have a list of GI identifiers form the NCBI nucleotide database. For instance take just this one: `76365841`. I want to extract the "isolation source" term from it. The answer here is "Everglades wetlands" which you can see by using the "efetch".

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=76365841&rettype=gb

However when I hit a full chromosome that has a huge sequence, my program will download the full sequence and the biopython Entrez.parser is unable to handle that. For instance with: `332640072`

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=332640072&rettype=gb

Is there any way of building a request to NCBI to batch download the sequences information (including isolation source) WITHOUT downloading the actual sequence in terms of AGTC.

If you want to see the program:

    #python
    from Bio import Entrez
    gis = ['332640072', '76365841', '22506766', '389043336']
    response = Entrez.efetch(db="nucleotide", id=gis, retmode="xml")
    records = list(Entrez.parse(response, validate=True))
xml efetch python ncbi • 2.1k views
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Xapple30
1
gravatar for Pierre Lindenbaum
5.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum126k wrote:

when you parse a large XML, the idea is to skip some elements using either a STAX parser http://en.wikipedia.org/wiki/StAX or a SAX parser  (http://en.wikipedia.org/wiki/Simple_API_for_XML) . Both should be available in python.

For example, the following java program would skip the content of the DNA sequence:

 

javac Biostar113766.java
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=22506766,332640072,76365841,389043336&retmode=xml" |\
        java  Biostar113766 | grep '<GBSeq_sequence>'
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>

 

 

ADD COMMENTlink written 5.4 years ago by Pierre Lindenbaum126k
0
gravatar for 5heikki
5.4 years ago by
5heikki8.6k
Finland
5heikki8.6k wrote:

Not all sequences have that information, but for example with Entrez direct:

 

epost -db nuccore -id 22506766,332640072,76365841,389043336 | efetch -format docsum | xtract -element SubName | tr "\t" "\n"
3
PB131|Panama: Panama Province, Las Cumbres Lake|lake water at 5 m depth during dry season|9.0986 N 79.5392 W
F124|USA: Florida|Everglades wetlands
SFD1-19|USA: San Francisco Delta, Mildred Island 2000-07-20

 

I'm not so sure the SubName element is standard though.

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by 5heikki8.6k

Thanks for the answer ! But your solution still downloads the whole >10MB sequence.

ADD REPLYlink written 5.4 years ago by Xapple30

It most certainly does not download the whole sequence

epost -db nuccore -id 332640072 | efetch -format docsum > file
du -h file
4.0K    file
ADD REPLYlink written 5.4 years ago by 5heikki8.6k

I wanted to get only the meta-information

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Xapple30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1101 users visited in the last hour