NCBI eutils: retrieve all isolation sources and pubmed IDs
0
0
Entering edit mode
8.7 years ago
Xapple ▴ 30

Hi !

For some research we are doing, we would like to build a small database (such as in SQLite3) where we would store just one table with three columns. The first would be the GI number of every sequence in the current NT database. You can get these easily with this command:

     blastdbcmd -db nt -entry all -outfmt '%g' > all_gis.txt

The second column, would contain the "isolation_source" entry for that sequence, if it has one in its record, otherwise we can ommit that row from the database.

The last column should contain the pubmed ID of the publication associated with that sequence, if it has one.

This isn't too hard to do, and I have written a script that does exactly that by querying NCBI trough the eutils with biopython:

The problem is that as it is running right now, the current estimate on the finish time is 170+ hours. I would like to be able to have some results faster... Do you know of any way to optimize this process ? Maybe by changing the queries that are sent to NCBI ? Currently the script queries NCBI for a particular GI number and receives the whole XML entry back. Is there a way to formulate a query to only obtain the two fields that interest us: isolation_source and pubmed_id ? It's quite frustrating to only be able to access this huge and very useful database over the web with some custom-archaic utils like "efetch" etc.

Thanks !

ncbi eutils python • 2.5k views
ADD COMMENT
1
Entering edit mode

Why don't you (1) download all sequences from GenBank (ftp://ftp.ncbi.nlm.nih.gov/genbank/) in GenBank record (Flat file) and then (2) loop over each record using Python and extract isolation source and pubmed id?

ADD REPLY
0
Entering edit mode

Hey you're right. That would actually solve this whole issue. So ALL the genbank entries are stored in those archives? That's going to weigh a lot on the hard drive but it's probably doable. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6