Question

Fetching Genbank Assembly IDs with protein

0

Entering edit mode

4.2 years ago

anabaena ▴ 10

Hey all, I am interested in getting records particular CDS matches. I have particular BLAST CDS matches that I need to retrieve the Genbank genome assembly ascension number. I have several hundred so I wanted to automate this with Entrez in biopython if possible. My blast results give me a protein ID number, that when searched in NCBI, come up with the metagenome that it is associated with, I was just looking for guidance using biopython to retrieve the Genome assembly GenBank ascension ID. My current script currently isn't even finding the record? which is odd because it shows up when searching NCBI's protein database on the website. I did a search with the below script and had 0 hits.

 from Bio import Entrez

Entrez.email = "xxxxx@xxxx.com

handle = Entrez.egquery(term='MAG94467.1')
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
    print(row['DbName'], row["Count"])

Biopython Python Entrez • 2.7k views

ADD COMMENT • link updated 4.2 years ago by vkkodali_ncbi ★ 3.7k • written 4.2 years ago by anabaena ▴ 10

1

Entering edit mode

4.2 years ago

anabaena ▴ 10

if anyone is interested here is what I was trying to do:

def find_contig_id(protein):
    from Bio import Entrez, SeqIO
    Entrez.email = 'zack.henning@berkeley.edu'
    print(protein)
    eser_handle = Entrez.efetch(id=protein, db ='protein', rettype='gb', retmode='text')
    eser_record = SeqIO.read(eser_handle, 'gb')
    interim = eser_record.annotations['db_source']
    research = interim.lstrip('accession ')
    r_handle = Entrez.efetch(id=research, db='nucleotide', rettype='gb', retmode='text')
    r_record = SeqIO.read(r_handle, 'gb')
    result = r_record.annotations['accessions'][1]
    s_handle = Entrez.esummary(db='nucleotide', id=result, rettype='gb', retmode='text')
    s_record = Entrez.read(s_handle)
    print(s_record[0]['AccessionVersion'])
    return s_record[0]['AccessionVersion']

def Retrieve_Assembly_Accession(contig):
    from Bio import Entrez
    Entrez.email = 'zack.henning@berkeley.edu'
    esearch_file = Entrez.esearch(db='assembly', term=contig)
    esearch_record = Entrez.read(esearch_file)
    for id in esearch_record['IdList']:
        es_handle = Entrez.esummary(db='assembly', id=id, report='full')
        es_record = Entrez.read(es_handle)
        result_final = es_record['DocumentSummarySet']['DocumentSummary'][0]['AssemblyAccession']
        print(result_final)
        return result_final


contig_id = []
accession_id = []


for x in aioA_database['Subject ID']:
    query = find_contig_id(x)
    contig_id.append(query)
    accession = Retrieve_Assembly_Accession(query)
    accession_id.append(accession)

ADD COMMENT • link 4.2 years ago by anabaena ▴ 10

0

Entering edit mode

Is this final answer to your original question or just additional information? If latter this should either be added to the original post or moved to a comment on top post.

ADD REPLY • link 4.2 years ago by GenoMax 141k

1

Entering edit mode

4.2 years ago

vkkodali_ncbi ★ 3.7k

Thanks, I understand this better now. For this purpose, here's what I would do:

Download the entire list of GCA accessions
For each GCA accession, download the feature_table.txt file from NCBI FTP. For example, the feature_table.txt file for the assembly GCA_002691735.1 is here. You should be able to write a bash loop to fetch all of the feature_table.txt files. Either that, or you can go to the BioProject page, click on the 'Assembly' link on the right hand side to list all associated assemblies. Then, click on the blue 'Download Assemblies' button and pick 'Feature table' as your file of choice.
Once you have all of the feature table files, concatenate them, and extract the lines with CDS in column 1, and a protein accession in column 11; column 3 has the GCA accession.

ADD COMMENT • link 4.2 years ago by vkkodali_ncbi ★ 3.7k

score 2 · Accepted Answer · 2020-01-24

2

Entering edit mode

4.2 years ago

vkkodali_ncbi ★ 3.7k

Since you already have accessions, try the following:

from Bio import Entrez,SeqIO

Entrez.email = 'xxx@yyy.zzz'

handle = Entrez.efetch(id = 'MAG94467.1', db = 'protein', rettype='gb', retmode='text')
record = SeqIO.read(handle, 'genbank')
print(record)

ADD COMMENT • link 4.2 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Alright so I am getting the WGS/contig ascencion but not the assembly ID (i.e. starts with GCA_XXXX.X). Any idea on how to retrieve that?

ADD REPLY • link 4.2 years ago by anabaena ▴ 10

0

Entering edit mode

What is it that you are looking for in the end? You can search the ‘assembly’ database and fetch summary about the assembly but that won't get you the sequence. If you want A specific sequence then you need to figure out which sequence it is you want from the assembly because an assembly can have more than one nucleotide sequence and each of those in turn can have multiple annotated proteins. Perhaps I can help you if you can describe what it is that you are trying to achieve in a little more detail.

ADD REPLY • link 4.2 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

So what the end game is, is to append a pandas dataframe with the Genbank Assembly ID (i.e. GCA_XXXX.X). I ran a blast of a concatenated .faa file from thousands of assembled genomes(obtained from a BioProject) through making a local database, and do not have the assembly ID's associated the protein accension numbers anymore. So what I need to do is find the complete Genbank assembly ID 's to add to the data frame and continue form there. The problem is when I search the protein acc ID (i.e. MAG94467.1) I get a contig/WGS acc ID but not the genome assembly ID, although when I search manually on NCBI I see it come up. The problem is I need to automate this as I have alot of different BLAST searches and a lot of different proteins so I am trying to do it through Entrez

ADD REPLY • link 4.2 years ago by anabaena ▴ 10