Question: Fetching Genbank Entries For List Of Accession Numbers.
5
gravatar for Sanjarbek Hudaiberdiev
4.5 years ago by
Trieste, Italy
Sanjarbek Hudaiberdiev70 wrote:

I have a looong list of accession numbers, for which I need to fetch genbank entries. Usually, I used to use epost-efetch workflow for long lists. But now I can not, because epost doesn't accept accession numbers as IDs.

So the question is:

a) is there a way of batch downloading using accession numbers? if not b) is there a way of mapping Accession number to normal ids?

ncbi entrez • 15k views
ADD COMMENTlink modified 15 months ago by ifreecell120 • written 4.5 years ago by Sanjarbek Hudaiberdiev70

can you give some example of your accession number? where is it from?

ADD REPLYlink written 4.5 years ago by Leszek3.8k

Ex: A22237,A22239,A32021,A32022,A33397 Those are accessions from NCBI. When you post them using epost, it gives this error: "IDs contain invalid characters which was treated as delimiters." So it appears to me that epost doesn't accept non-numeric characters for ID field. I tried to change the letters to their ascii codes, didn't help.

ADD REPLYlink written 4.5 years ago by Sanjarbek Hudaiberdiev70

One more thing that I noticed today: All the BioXXX libraries just stop when they get error from epost. But I noticed that along with error, epost still returns WebEnv and query_key. But what it does is that, it takes the accession number, trims out the non-numeric characters, and searches for resultant GID. So, A22237 turns to 22237. Don't know what to do. Such a tiny problem taking up lot's of time.

ADD REPLYlink written 4.5 years ago by Sanjarbek Hudaiberdiev70
6
gravatar for Eli Korvigo
17 months ago by
Eli Korvigo80
Russian Federation
Eli Korvigo80 wrote:

Python 3 + BioPython with command-line interface. Basically, a less hardcoded and flexible version of Leszek's code, that doesn't fail when the list of accessions is so big, that GI retrieval results in HTTP Error 414: Request-URI Too Large

#! /usr/bin/env python3


import argparse
import sys
import os

import Bio.Entrez


RETMAX = 10**9
GB_EXT = ".gb"


def parse_args(arg_lst):
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input", type=str, required=True,
                        help="A file with accessions to download")
    parser.add_argument("-d", "--database", type=str, required=True,
                        help="NCBI database ID")
    parser.add_argument("-e", "--email", type=str, required=False,
                        default="some_email@somedomain.com",
                        help="An e-mail address")
    parser.add_argument("-b", "--batch", type=int, required=False, default=100,
                        help="The number of accessions to process per request")
    parser.add_argument("-o", "--output_dir", type=str, required=True,
                        help="The directory to write downloaded files to")

    return parser.parse_args(arg_lst)


def read_accessions(fp):
    with open(fp) as acc_lines:
        return [line.strip() for line in acc_lines]


def accessions_to_gb(accessions, db, batchsize, retmax):
    def batch(sequence, size):
        l = len(accessions)
        for start in range(0, l, size):
            yield sequence[start:min(start + size, l)]

    def extract_records(records_handle):
        buffer = []
        for line in records_handle:
            if line.startswith("LOCUS") and buffer:
                # yield accession number and record
                yield buffer[0].split()[1], "".join(buffer)
                buffer = [line]
            else:
                buffer.append(line)
        yield buffer[0].split()[1], "".join(buffer)

    def process_batch(accessions_batch):
        # get GI for query accessions
        query = " ".join(accessions_batch)
        query_handle = Bio.Entrez.esearch(db=db, term=query, retmax=retmax)
        gi_list = Bio.Entrez.read(query_handle)['IdList']

        # get GB files
        search_handle = Bio.Entrez.epost(db=db, id=",".join(gi_list))
        search_results = Bio.Entrez.read(search_handle)
        webenv, query_key = search_results["WebEnv"], search_results["QueryKey"]
        records_handle = Bio.Entrez.efetch(db=db, rettype="gb", retmax=batchsize,
                                           webenv=webenv, query_key=query_key)
        yield from extract_records(records_handle)

    accession_batches = batch(accessions, batchsize)
    for acc_batch in accession_batches:
        yield from process_batch(acc_batch)


def write_record(dir, accession, record):
    with open(os.path.join(dir, accession + GB_EXT), "w") as output:
        print(record, file=output)


def main(argv):
    args = parse_args(argv)
    accessions = read_accessions(os.path.abspath(args.input))
    op_dir = os.path.abspath(args.output_dir)
    if not os.path.exists(op_dir):
        os.makedirs(op_dir)
    dbase = args.database
    Bio.Entrez.email = args.email
    batchsize = args.batch

    for acc, record in accessions_to_gb(accessions, dbase, batchsize, RETMAX):
        write_record(op_dir, acc, record)


if __name__ == "__main__":
    main(sys.argv[1:])
ADD COMMENTlink modified 17 months ago • written 17 months ago by Eli Korvigo80

If I want to download the full genbank file of my query id , I found this scripty can't work.and then ,I set the rettype to genbank , It doesn't work . So ,what should I do ? I want to download thousands of genome files

ADD REPLYlink written 12 months ago by mark9008300

Do you mean that the file has no sequence in it? There are some entries in GenBank that do not provide sequences, only annotation and metadata. If it's not the case, then I'll gladly help you and update the code. Waiting for your feedback.

ADD REPLYlink modified 12 months ago • written 12 months ago by Eli Korvigo80

your code is pretty cool . I mean I want to download complete genome file . I just don't know how to select the rettype . I have solved the issue . If want to download complete genome file (genbank full), the rettype should be gbwithparts(rettype="gbwithparts") ,which means genbank file with protein sequences. Thank you so much ,and thanks for your nice code.

ADD REPLYlink written 12 months ago by mark9008300

Hi, probably little off topic but I wasn't able to find anywhere how to pass seq_start and seq_stop optional arguments to list of queries. What I mean is, when, like in this script, you search batch of accs, obtain web-env key to easy e-fetch what you want in one step. Now e-fetch accepts seq_start and seq_stop controlling what part of sequence you get, but how to pass this information to e-fetch for multiple sequences is a mystery to me. (From what I've tried it returns only first entry correctly omitting others or it omits seq_start and seq_stop and return those entries.

Any suggestion would be appreciated. Thanks

ADD REPLYlink written 12 months ago by massa.kassa.sc3na0

Hi , Would it be possible to get the fasta sequence given a list of accessions instead of the genbank files ??

ADD REPLYlink written 5 months ago by David60

Yes, you need to change the rettype argument from gb to fasta

ADD REPLYlink written 4 months ago by Eli Korvigo80
4
gravatar for Leszek
4.5 years ago by
Leszek3.8k
IIMCB, Poland
Leszek3.8k wrote:

You can try this:

#!/usr/bin/env python
"""Fetch GenBank entries for given accessions. 

USAGE:
  python acc2gb.py A22237 A22239 A32021 A32022 A33397 > out.gb
or
  cat ids | python acc2gb.py > out.gb

DEPENDENCIES:
Biopython
"""

import sys
from Bio import Entrez

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#fetch
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
for i,acc in enumerate(accs):
  try:
    sys.stderr.write( " %9i %s          \r" % (i+1,acc))  
    handle = Entrez.efetch(db=db, rettype="gb", id=acc)
    #print output to stdout
    sys.stdout.write(handle.read())
  except:
    sys.stderr.write( "Error! Cannot fetch: %s        \n" % acc)  

EDIT
The same using epost:

import sys
from Bio import Entrez

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"
batchSize    = 100
retmax       = 10**9

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#first get GI for query accesions
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
query  = " ".join(accs)
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']
sys.stderr.write( "Found %s GI: %s\n" % (len(giList), ", ".join(giList[:10])))
#post NCBI query
search_handle     = Entrez.epost(db=db, id=",".join(giList))
search_results    = Entrez.read(search_handle)
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  sys.stderr.write( " %9i" % (start+1,))
  #fetch entries in batch
  handle = Entrez.efetch(db=db, rettype="gb", retstart=start, retmax=batchSize, webenv=webenv, query_key=query_key)
  #print output to stdout
  sys.stdout.write(handle.read())

ADD COMMENTlink modified 4.4 years ago • written 4.5 years ago by Leszek3.8k

I usually use this way. But it's extremely slow when you want to download thousands of entries and sometimes falls to timeout. Using epost would be much better. But, apparently there's no way of doing that.

ADD REPLYlink written 4.4 years ago by Sanjarbek Hudaiberdiev70

you can do epost easily with biopython - you just have to be sure to provide valid accessions, otherwise the script will crash... have a look here how to do it: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114

ADD REPLYlink written 4.4 years ago by Leszek3.8k

I'm having the same issue with epost not taking accession numbers but integer UIDs instead...

ADD REPLYlink written 4.3 years ago by junyinglim0

Look at edited option: you can provide any id that is accepted by NCBI and the script will get UIDs of these automatically and print fasta...

ADD REPLYlink written 4.3 years ago by Leszek3.8k

This works fantastic! Just couldn't think myself of searching by accessions and getting ['IdList'] back. Thanks Leszek.

ADD REPLYlink written 4.2 years ago by Sanjarbek Hudaiberdiev70
1

you can find more tricks in biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc110

ADD REPLYlink written 4.2 years ago by Leszek3.8k

@ Leszec I want to retrieve gene symbols using RefSeq protein(stored in text file) how can i do this using the program when i run the script saving it as .py its showing "SYNTAX ERROR" I use Python2.7 Help me to work on the script Thanks in advance

ADD REPLYlink written 4 months ago by lakshmi.bioinformatics20
3
gravatar for ifreecell
15 months ago by
ifreecell120
Germany
ifreecell120 wrote:

I have successfully used wget to download more than 1000 plastid Genbank files using Accessions (with version, NC_018523.1). It's not fast, but works.

mkdir -p gb fa
cut -f 1 plastid.accs | while read -r acc
do
    wget -O gb/"${acc}.gb" \
       "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=gb&sendto=on&id=$acc"
#    sleep 1s
#    wget -O fa/"${acc}.fa" \
#       "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&sendto=on&id=$acc"
done

To map accession to gi, you could try

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz # ~800MB
gunzip nucl_gb.accession2taxid.gz
grep -f plastid.accs nucl_gb.accession2taxid | cut -f  1, 4 > Acc_GI.csv
ADD COMMENTlink written 15 months ago by ifreecell120

I have the same question . I try your method. It works.But If I want to download the full genbank file . what should I do? change the " dopt=gb"? what type should I use?

ADD REPLYlink written 12 months ago by mark9008300

Replacing dopt=gb with dopt=gbwithparts should work, but I have not try yet.

ADD REPLYlink written 11 months ago by ifreecell120
2
gravatar for Daniel
4.5 years ago by
Daniel3.5k
Cardiff University
Daniel3.5k wrote:

The hit-it-on-the-head-with-a-hammer alternative is to batch download a whole section of genbank, then filter out the ones you want using bioperl which will take accessions.

i.e. something along the lines of

    foreach my $acc (@acc){
            if ($seq->accession eq $acc){
                       $fileout->write_seq($seq);           ##etc. etc. etc.
            }
ADD COMMENTlink written 4.5 years ago by Daniel3.5k

Any suggestions on how to download the whole data of genbank?

ADD REPLYlink written 4.4 years ago by Sanjarbek Hudaiberdiev70

There is the ftp if you want to batch DL ftp://ftp.ncbi.nlm.nih.gov/genbank/ But if you're only concerned about a certain taxa (prok, euk, human, whatever) you can just browser download it via http://www.ncbi.nlm.nih.gov/nuccore

ADD REPLYlink written 4.4 years ago by Daniel3.5k
0
gravatar for secretjess
4.5 years ago by
secretjess170
Cambridge
secretjess170 wrote:

I'm not too sure which accession number you mean and what normal ID you are referring to but http://david.abcc.ncifcrf.gov/conversion.jsp has the ability to map between various accessions (e.g. genbank, uniprot) and IDs (e.g. entrez, ensembl)

ADD COMMENTlink written 4.5 years ago by secretjess170

I'm working with GenBank. So, I need to map Genbank accession numbers to Genbank GI number. DAVID can't find the majority of the accessions that I submit. Moreover, it doesn't provide API, afaik.

ADD REPLYlink written 4.5 years ago by Sanjarbek Hudaiberdiev70

Ah, sorry that wasn't helpful. This looks like a similar question to yours (in reverse) so maybe the solution will be more useful: Map Genbank Gi To Accession Numbers (A Million Times)

ADD REPLYlink written 4.5 years ago by secretjess170
1

I had a look on that thread. The problem is not solved there. If I had GIDs first, and wanted to convert them to Accessions, then that would be viable via epost-efetch:). But can't do the other way around. Thank you for your suggestions.

ADD REPLYlink written 4.5 years ago by Sanjarbek Hudaiberdiev70
0
gravatar for Wen.Huang
4.5 years ago by
Wen.Huang1.1k
Wen.Huang1.1k wrote:

two not so specific cents:

1) NCBI has a well documented but under-used collection of E-utilities you may want to try

2) Bio-perl has modules to fetch GenBank entries using different IDs.

ADD COMMENTlink written 4.5 years ago by Wen.Huang1.1k

1) Yes, I like very much working with E-utils. It's the first time I got stuck with a problem. 2) True. There's no problem in fetching by accession number in one-request-at-a-time manner. But it's slow, due to 1/3 seconds limitation of NCBI. The problem comes when you want to try epost for batch downloading. Thank you for your suggestions!

ADD REPLYlink written 4.5 years ago by Sanjarbek Hudaiberdiev70
0
gravatar for aidan-budd
4.5 years ago by
aidan-budd1.8k
Germany
aidan-budd1.8k wrote:

Depends on how long "long" is, but up to some length, you can even use the Entrez web interface to get these e.g. querying with "A22237 A22239 A32021 A32022 A33397" here http://www.ncbi.nlm.nih.gov gives you a link to this page:

http://www.ncbi.nlm.nih.gov/nuccore/A22237A22239A32021A32022A33397

and you can get these in a range of formats e.g. full genbank, fasta, etc., using the "Send to" functionality

ADD COMMENTlink written 4.5 years ago by aidan-budd1.8k
0
gravatar for Sanjarbek Hudaiberdiev
4.2 years ago by
Trieste, Italy
Sanjarbek Hudaiberdiev70 wrote:

The script provided by Leszek solves the problem.

ADD COMMENTlink written 4.2 years ago by Sanjarbek Hudaiberdiev70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 880 users visited in the last hour