Download only sequence information from NCBI
2
0
Entering edit mode
6.8 years ago
Xapple ▴ 30

My problem is the following: I have a list of GI identifiers form the NCBI nucleotide database. For instance take just this one: `76365841`. I want to extract the "isolation source" term from it. The answer here is "Everglades wetlands" which you can see by using the "efetch".

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=76365841&rettype=gb

However when I hit a full chromosome that has a huge sequence, my program will download the full sequence and the biopython Entrez.parser is unable to handle that. For instance with: `332640072`

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=332640072&rettype=gb

Is there any way of building a request to NCBI to batch download the sequences information (including isolation source) WITHOUT downloading the actual sequence in terms of AGTC.

If you want to see the program:

    #python
    from Bio import Entrez
    gis = ['332640072', '76365841', '22506766', '389043336']
    response = Entrez.efetch(db="nucleotide", id=gis, retmode="xml")
    records = list(Entrez.parse(response, validate=True))
ncbi python xml efetch • 2.6k views
ADD COMMENT
1
Entering edit mode
6.8 years ago

when you parse a large XML, the idea is to skip some elements using either a STAX parser http://en.wikipedia.org/wiki/StAX or a SAX parser  (http://en.wikipedia.org/wiki/Simple_API_for_XML) . Both should be available in python.

For example, the following java program would skip the content of the DNA sequence:

 

javac Biostar113766.java
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=22506766,332640072,76365841,389043336&retmode=xml" |\
        java  Biostar113766 | grep '<GBSeq_sequence>'
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>
  <GBSeq_sequence></GBSeq_sequence>

 

 

ADD COMMENT
0
Entering edit mode
6.8 years ago
5heikki 9.8k

Not all sequences have that information, but for example with Entrez direct:

 

epost -db nuccore -id 22506766,332640072,76365841,389043336 | efetch -format docsum | xtract -element SubName | tr "\t" "\n"
3
PB131|Panama: Panama Province, Las Cumbres Lake|lake water at 5 m depth during dry season|9.0986 N 79.5392 W
F124|USA: Florida|Everglades wetlands
SFD1-19|USA: San Francisco Delta, Mildred Island 2000-07-20

 

I'm not so sure the SubName element is standard though.

ADD COMMENT
0
Entering edit mode

Thanks for the answer ! But your solution still downloads the whole >10MB sequence.

ADD REPLY
0
Entering edit mode

It most certainly does not download the whole sequence

epost -db nuccore -id 332640072 | efetch -format docsum > file
du -h file
4.0K    file
ADD REPLY
0
Entering edit mode

I wanted to get only the meta-information

ADD REPLY

Login before adding your answer.

Traffic: 1792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6