Fetching Genbank Entries For List Of Accession Numbers.
8
12
Entering edit mode
10.0 years ago
Sanjar ▴ 140

I have a looong list of accession numbers, for which I need to fetch genbank entries. Usually, I used to use epost-efetch workflow for long lists. But now I can not, because epost doesn't accept accession numbers as IDs.

So the question is:

a) is there a way of batch downloading using accession numbers? if not b) is there a way of mapping Accession number to normal ids?

ncbi entrez • 35k views
0
Entering edit mode

can you give some example of your accession number? where is it from?

0
Entering edit mode

Ex: A22237,A22239,A32021,A32022,A33397 Those are accessions from NCBI. When you post them using epost, it gives this error: "IDs contain invalid characters which was treated as delimiters." So it appears to me that epost doesn't accept non-numeric characters for ID field. I tried to change the letters to their ascii codes, didn't help.

0
Entering edit mode

One more thing that I noticed today: All the BioXXX libraries just stop when they get error from epost. But I noticed that along with error, epost still returns WebEnv and query_key. But what it does is that, it takes the accession number, trims out the non-numeric characters, and searches for resultant GID. So, A22237 turns to 22237. Don't know what to do. Such a tiny problem taking up lot's of time.

8
Entering edit mode
6.9 years ago
Eli Korvigo ▴ 230

Python 3 + BioPython with command-line interface. Basically, a less hardcoded and flexible version of Leszek's code, that doesn't fail when the list of accessions is so big, that GI retrieval results in HTTP Error 414: Request-URI Too Large

#! /usr/bin/env python3

import argparse
import sys
import os

import Bio.Entrez

RETMAX = 10**9
GB_EXT = ".gb"

def parse_args(arg_lst):
parser = argparse.ArgumentParser()
help="NCBI database ID")
default="some_email@somedomain.com",
help="The number of accessions to process per request")

return parser.parse_args(arg_lst)

with open(fp) as acc_lines:
return [line.strip() for line in acc_lines]

def accessions_to_gb(accessions, db, batchsize, retmax):
def batch(sequence, size):
l = len(accessions)
for start in range(0, l, size):
yield sequence[start:min(start + size, l)]

def extract_records(records_handle):
buffer = []
for line in records_handle:
if line.startswith("LOCUS") and buffer:
# yield accession number and record
yield buffer[0].split()[1], "".join(buffer)
buffer = [line]
else:
buffer.append(line)
yield buffer[0].split()[1], "".join(buffer)

def process_batch(accessions_batch):
# get GI for query accessions
query = " ".join(accessions_batch)
query_handle = Bio.Entrez.esearch(db=db, term=query, retmax=retmax)

# get GB files
search_handle = Bio.Entrez.epost(db=db, id=",".join(gi_list))
webenv, query_key = search_results["WebEnv"], search_results["QueryKey"]
records_handle = Bio.Entrez.efetch(db=db, rettype="gb", retmax=batchsize,
webenv=webenv, query_key=query_key)
yield from extract_records(records_handle)

accession_batches = batch(accessions, batchsize)
for acc_batch in accession_batches:
yield from process_batch(acc_batch)

def write_record(dir, accession, record):
with open(os.path.join(dir, accession + GB_EXT), "w") as output:
print(record, file=output)

def main(argv):
args = parse_args(argv)
op_dir = os.path.abspath(args.output_dir)
if not os.path.exists(op_dir):
os.makedirs(op_dir)
dbase = args.database
Bio.Entrez.email = args.email
batchsize = args.batch

for acc, record in accessions_to_gb(accessions, dbase, batchsize, RETMAX):
write_record(op_dir, acc, record)

if __name__ == "__main__":
main(sys.argv[1:])

0
Entering edit mode

If I want to download the full genbank file of my query id , I found this scripty can't work.and then ,I set the rettype to genbank , It doesn't work . So ,what should I do ? I want to download thousands of genome files

0
Entering edit mode

Do you mean that the file has no sequence in it? There are some entries in GenBank that do not provide sequences, only annotation and metadata. If it's not the case, then I'll gladly help you and update the code. Waiting for your feedback.

0
Entering edit mode

your code is pretty cool . I mean I want to download complete genome file . I just don't know how to select the rettype . I have solved the issue . If want to download complete genome file (genbank full), the rettype should be gbwithparts(rettype="gbwithparts") ,which means genbank file with protein sequences. Thank you so much ,and thanks for your nice code.

0
Entering edit mode

Hi, probably little off topic but I wasn't able to find anywhere how to pass seq_start and seq_stop optional arguments to list of queries. What I mean is, when, like in this script, you search batch of accs, obtain web-env key to easy e-fetch what you want in one step. Now e-fetch accepts seq_start and seq_stop controlling what part of sequence you get, but how to pass this information to e-fetch for multiple sequences is a mystery to me. (From what I've tried it returns only first entry correctly omitting others or it omits seq_start and seq_stop and return those entries.

Any suggestion would be appreciated. Thanks

0
Entering edit mode

Hi , Would it be possible to get the fasta sequence given a list of accessions instead of the genbank files ??

0
Entering edit mode

Yes, you need to change the rettype argument from gb to fasta

0
Entering edit mode

Hi, I tried this script but it shows syntax error as below:

File "accfetchfinal.py", line 66
yield from extract_records(records_handle)
^
SyntaxError: invalid syntax


Could you help in this pls..?

1
Entering edit mode

Hello, sorry for a late reply. Your Python seems to be too old. You should update to Python >= 3.3.2

7
Entering edit mode
6.8 years ago
ifreecell ▴ 220

I have successfully used wget to download more than 1000 plastid Genbank files using Accessions (with version, NC_018523.1). It's not fast, but works.

mkdir -p gb fa
cut -f 1 plastid.accs | while read -r acc
do
wget -O gb/"${acc}.gb" \ "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=gb&sendto=on&id=$acc"
#    sleep 1s
#    wget -O fa/"${acc}.fa" \ # "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&sendto=on&id=$acc"
done


To map accession to gi, you could try

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz # ~800MB
gunzip nucl_gb.accession2taxid.gz
grep -f plastid.accs nucl_gb.accession2taxid | cut -f  1, 4 > Acc_GI.csv

0
Entering edit mode

I have the same question . I try your method. It works.But If I want to download the full genbank file . what should I do? change the " dopt=gb"? what type should I use?

0
Entering edit mode

Replacing dopt=gb with dopt=gbwithparts should work, but I have not try yet.

0
Entering edit mode

It's not fast, but works.

5
Entering edit mode
10.0 years ago
Leszek 4.2k

You can try this:

#!/usr/bin/env python
"""Fetch GenBank entries for given accessions.

USAGE:
python acc2gb.py A22237 A22239 A32021 A32022 A33397 > out.gb
or
cat ids | python acc2gb.py > out.gb

DEPENDENCIES:
Biopython
"""

import sys
from Bio import Entrez

db           = "nuccore"
Entrez.email = "some_email@somedomain.com"

if len(sys.argv[1:]) > 1:
accs = sys.argv[1:]
accs = [ l.strip() for l in sys.stdin if l.strip() ]
#fetch
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
for i,acc in enumerate(accs):
try:
sys.stderr.write( " %9i %s          \r" % (i+1,acc))
handle = Entrez.efetch(db=db, rettype="gb", id=acc)
#print output to stdout
except:
sys.stderr.write( "Error! Cannot fetch: %s        \n" % acc)


EDIT
The same using epost:

import sys
from Bio import Entrez

db           = "nuccore"
Entrez.email = "some_email@somedomain.com"
batchSize    = 100
retmax       = 10**9

if len(sys.argv[1:]) > 1:
accs = sys.argv[1:]
accs = [ l.strip() for l in sys.stdin if l.strip() ]
#first get GI for query accesions
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
query  = " ".join(accs)
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
sys.stderr.write( "Found %s GI: %s\n" % (len(giList), ", ".join(giList[:10])))
#post NCBI query
search_handle     = Entrez.epost(db=db, id=",".join(giList))
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"]
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
sys.stderr.write( " %9i" % (start+1,))
#fetch entries in batch
handle = Entrez.efetch(db=db, rettype="gb", retstart=start, retmax=batchSize, webenv=webenv, query_key=query_key)
#print output to stdout


0
Entering edit mode

I usually use this way. But it's extremely slow when you want to download thousands of entries and sometimes falls to timeout. Using epost would be much better. But, apparently there's no way of doing that.

0
Entering edit mode

you can do epost easily with biopython - you just have to be sure to provide valid accessions, otherwise the script will crash... have a look here how to do it: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114

0
Entering edit mode

I'm having the same issue with epost not taking accession numbers but integer UIDs instead...

0
Entering edit mode

Look at edited option: you can provide any id that is accepted by NCBI and the script will get UIDs of these automatically and print fasta...

0
Entering edit mode

This works fantastic! Just couldn't think myself of searching by accessions and getting ['IdList'] back. Thanks Leszek.

1
Entering edit mode

you can find more tricks in biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc110

0
Entering edit mode

@ Leszec I want to retrieve gene symbols using RefSeq protein(stored in text file) how can i do this using the program when i run the script saving it as .py its showing "SYNTAX ERROR" I use Python2.7 Help me to work on the script Thanks in advance

0
Entering edit mode

Thank you for this solution. But its now 8 years later, is there really no better way to do this?

2
Entering edit mode
10.0 years ago
Daniel ★ 3.9k

The hit-it-on-the-head-with-a-hammer alternative is to batch download a whole section of genbank, then filter out the ones you want using bioperl which will take accessions.

i.e. something along the lines of

    foreach my $acc (@acc){ if ($seq->accession eq $acc){$fileout->write_seq(\$seq);           ##etc. etc. etc.
}

0
Entering edit mode

0
Entering edit mode

There is the ftp if you want to batch DL ftp://ftp.ncbi.nlm.nih.gov/genbank/ But if you're only concerned about a certain taxa (prok, euk, human, whatever) you can just browser download it via http://www.ncbi.nlm.nih.gov/nuccore

0
Entering edit mode
10.0 years ago
secretjess ▴ 210

I'm not too sure which accession number you mean and what normal ID you are referring to but http://david.abcc.ncifcrf.gov/conversion.jsp has the ability to map between various accessions (e.g. genbank, uniprot) and IDs (e.g. entrez, ensembl)

0
Entering edit mode

I'm working with GenBank. So, I need to map Genbank accession numbers to Genbank GI number. DAVID can't find the majority of the accessions that I submit. Moreover, it doesn't provide API, afaik.

0
Entering edit mode

Ah, sorry that wasn't helpful. This looks like a similar question to yours (in reverse) so maybe the solution will be more useful: Map Genbank Gi To Accession Numbers (A Million Times)

1
Entering edit mode

I had a look on that thread. The problem is not solved there. If I had GIDs first, and wanted to convert them to Accessions, then that would be viable via epost-efetch:). But can't do the other way around. Thank you for your suggestions.

0
Entering edit mode
10.0 years ago
Wen.Huang ★ 1.2k

two not so specific cents:

1) NCBI has a well documented but under-used collection of E-utilities you may want to try

2) Bio-perl has modules to fetch GenBank entries using different IDs.

0
Entering edit mode

1) Yes, I like very much working with E-utils. It's the first time I got stuck with a problem. 2) True. There's no problem in fetching by accession number in one-request-at-a-time manner. But it's slow, due to 1/3 seconds limitation of NCBI. The problem comes when you want to try epost for batch downloading. Thank you for your suggestions!

0
Entering edit mode
10.0 years ago

Depends on how long "long" is, but up to some length, you can even use the Entrez web interface to get these e.g. querying with "A22237 A22239 A32021 A32022 A33397" here http://www.ncbi.nlm.nih.gov gives you a link to this page:

http://www.ncbi.nlm.nih.gov/nuccore/A22237A22239A32021A32022A33397

and you can get these in a range of formats e.g. full genbank, fasta, etc., using the "Send to" functionality

0
Entering edit mode
9.7 years ago
Sanjar ▴ 140

The script provided by Leszek solves the problem.