Question: Get just GenBank record while downloading genome with Biopython
0
gravatar for Shred
7 weeks ago by
Shred150
Shred150 wrote:

Guys I wrote a script to download genome in gbk from NCBI while querying with specific keywords. What I want is the full annotated genome: currently I'm querying the "nucleotide" database, and I get (in my specific case) two results: the RefSeq record and the Genbank one. I'm expecting just one record, because there's just a reference genome for the organism queried. As I've read from NCBI website, in this case the RefSeq is just a referrer to the GenBank one (source), with no sequence inside. So, here's the point: is there a way to download just the genbank record with sequence inside, and by so discarding all the useless record gained? Here's my code:

from Bio import SeqIO
from Bio import Entrez

Entrez.email = "mail@gmail.com"
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title]"
handle = Entrez.esearch(db="nucleotide", term=search_term)
genome_ids = Entrez.read(handle)['IdList']

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
    filename = 'GenBank_Record_{}.gbk'.format(genome_id)
    print('Writing:{}'.format(filename))
    with open(filename, "w") as f:
        f.write(record.read())
print(genome_ids)
entrez biopython genome • 158 views
ADD COMMENTlink modified 7 weeks ago by vkkodali960 • written 7 weeks ago by Shred150
1
gravatar for vkkodali
7 weeks ago by
vkkodali960
United States
vkkodali960 wrote:

Change your search_term to include GenBank or RefSeq filter as shown below for GenBank and RefSeq sequences, respectively

## GenBank sequences only
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title] AND genbank[filter]"
## RefSeq sequences only
search_term = "Bifidobacterium+bifidum+PRL2010[organism] AND complete+genome[title] AND refseq[filter]"

If you are fetching a whole bunch of sequences, you may be interested in knowing about the implementation of Eutils API keys here to avoid any HTTP 429 errors.

ADD COMMENTlink written 7 weeks ago by vkkodali960

Thanks for the API recommend. Adding GenBank filter works, but in term of annotation this could be a problem, because reference genomes are by default more accurate than standard GenBank submission. I'm implementing a for loop to iterate into downloaded records to cut off sequence free files. It's crazy thinking on how much confused are submission in bioinformatics.

ADD REPLYlink written 7 weeks ago by Shred150

I'm implementing a for loop to iterate into downloaded records to cut off sequence free files.

Change your rettype to gbwithparts and all RefSeq flatfiles will be downloaded with contig sequences.

ADD REPLYlink written 7 weeks ago by vkkodali960

Fine, that's what I've been looking for.

ADD REPLYlink written 6 weeks ago by Shred150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1289 users visited in the last hour