Question: Fetching Genbank Assembly IDs with protein
gravatar for anabaena
10 months ago by
anabaena0 wrote:

Hey all, I am interested in getting records particular CDS matches. I have particular BLAST CDS matches that I need to retrieve the Genbank genome assembly ascension number. I have several hundred so I wanted to automate this with Entrez in biopython if possible. My blast results give me a protein ID number, that when searched in NCBI, come up with the metagenome that it is associated with, I was just looking for guidance using biopython to retrieve the Genome assembly GenBank ascension ID. My current script currently isn't even finding the record? which is odd because it shows up when searching NCBI's protein database on the website. I did a search with the below script and had 0 hits.

 from Bio import Entrez = "

handle = Entrez.egquery(term='MAG94467.1')
record =
for row in record["eGQueryResult"]:
    print(row['DbName'], row["Count"])
entrez biopython python • 625 views
ADD COMMENTlink modified 9 months ago by vkkodali2.2k • written 10 months ago by anabaena0
gravatar for vkkodali
10 months ago by
United States
vkkodali2.2k wrote:

Since you already have accessions, try the following:

from Bio import Entrez,SeqIO = 'xxx@yyy.zzz'

handle = Entrez.efetch(id = 'MAG94467.1', db = 'protein', rettype='gb', retmode='text')
record =, 'genbank')
ADD COMMENTlink written 10 months ago by vkkodali2.2k

Alright so I am getting the WGS/contig ascencion but not the assembly ID (i.e. starts with GCA_XXXX.X). Any idea on how to retrieve that?

ADD REPLYlink written 10 months ago by anabaena0

What is it that you are looking for in the end? You can search the ‘assembly’ database and fetch summary about the assembly but that won't get you the sequence. If you want A specific sequence then you need to figure out which sequence it is you want from the assembly because an assembly can have more than one nucleotide sequence and each of those in turn can have multiple annotated proteins. Perhaps I can help you if you can describe what it is that you are trying to achieve in a little more detail.

ADD REPLYlink written 10 months ago by vkkodali2.2k

So what the end game is, is to append a pandas dataframe with the Genbank Assembly ID (i.e. GCA_XXXX.X). I ran a blast of a concatenated .faa file from thousands of assembled genomes(obtained from a BioProject) through making a local database, and do not have the assembly ID's associated the protein accension numbers anymore. So what I need to do is find the complete Genbank assembly ID 's to add to the data frame and continue form there. The problem is when I search the protein acc ID (i.e. MAG94467.1) I get a contig/WGS acc ID but not the genome assembly ID, although when I search manually on NCBI I see it come up. The problem is I need to automate this as I have alot of different BLAST searches and a lot of different proteins so I am trying to do it through Entrez

ADD REPLYlink modified 10 months ago • written 10 months ago by anabaena0
gravatar for vkkodali
9 months ago by
United States
vkkodali2.2k wrote:

Thanks, I understand this better now. For this purpose, here's what I would do:

  1. Download the entire list of GCA accessions
  2. For each GCA accession, download the feature_table.txt file from NCBI FTP. For example, the feature_table.txt file for the assembly GCA_002691735.1 is here. You should be able to write a bash loop to fetch all of the feature_table.txt files. Either that, or you can go to the BioProject page, click on the 'Assembly' link on the right hand side to list all associated assemblies. Then, click on the blue 'Download Assemblies' button and pick 'Feature table' as your file of choice.
  3. Once you have all of the feature table files, concatenate them, and extract the lines with CDS in column 1, and a protein accession in column 11; column 3 has the GCA accession.
ADD COMMENTlink written 9 months ago by vkkodali2.2k
gravatar for anabaena
10 months ago by
anabaena0 wrote:

if anyone is interested here is what I was trying to do:

def find_contig_id(protein):
    from Bio import Entrez, SeqIO = ''
    eser_handle = Entrez.efetch(id=protein, db ='protein', rettype='gb', retmode='text')
    eser_record =, 'gb')
    interim = eser_record.annotations['db_source']
    research = interim.lstrip('accession ')
    r_handle = Entrez.efetch(id=research, db='nucleotide', rettype='gb', retmode='text')
    r_record =, 'gb')
    result = r_record.annotations['accessions'][1]
    s_handle = Entrez.esummary(db='nucleotide', id=result, rettype='gb', retmode='text')
    s_record =
    return s_record[0]['AccessionVersion']

def Retrieve_Assembly_Accession(contig):
    from Bio import Entrez = ''
    esearch_file = Entrez.esearch(db='assembly', term=contig)
    esearch_record =
    for id in esearch_record['IdList']:
        es_handle = Entrez.esummary(db='assembly', id=id, report='full')
        es_record =
        result_final = es_record['DocumentSummarySet']['DocumentSummary'][0]['AssemblyAccession']
        return result_final

contig_id = []
accession_id = []

for x in aioA_database['Subject ID']:
    query = find_contig_id(x)
    accession = Retrieve_Assembly_Accession(query)
ADD COMMENTlink written 10 months ago by anabaena0

Is this final answer to your original question or just additional information? If latter this should either be added to the original post or moved to a comment on top post.

ADD REPLYlink written 10 months ago by genomax92k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1082 users visited in the last hour