Question: Fetching Genbank Assembly IDs with protein
0
gravatar for anabaena
25 days ago by
anabaena0
anabaena0 wrote:

Hey all, I am interested in getting records particular CDS matches. I have particular BLAST CDS matches that I need to retrieve the Genbank genome assembly ascension number. I have several hundred so I wanted to automate this with Entrez in biopython if possible. My blast results give me a protein ID number, that when searched in NCBI, come up with the metagenome that it is associated with, I was just looking for guidance using biopython to retrieve the Genome assembly GenBank ascension ID. My current script currently isn't even finding the record? which is odd because it shows up when searching NCBI's protein database on the website. I did a search with the below script and had 0 hits.

 from Bio import Entrez

Entrez.email = "xxxxx@xxxx.com

handle = Entrez.egquery(term='MAG94467.1')
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
    print(row['DbName'], row["Count"])
entrez biopython python • 221 views
ADD COMMENTlink modified 15 days ago by vkkodali1.7k • written 25 days ago by anabaena0
2
gravatar for vkkodali
24 days ago by
vkkodali1.7k
United States
vkkodali1.7k wrote:

Since you already have accessions, try the following:

from Bio import Entrez,SeqIO

Entrez.email = 'xxx@yyy.zzz'

handle = Entrez.efetch(id = 'MAG94467.1', db = 'protein', rettype='gb', retmode='text')
record = SeqIO.read(handle, 'genbank')
print(record)
ADD COMMENTlink written 24 days ago by vkkodali1.7k

Alright so I am getting the WGS/contig ascencion but not the assembly ID (i.e. starts with GCA_XXXX.X). Any idea on how to retrieve that?

ADD REPLYlink written 24 days ago by anabaena0

What is it that you are looking for in the end? You can search the ‘assembly’ database and fetch summary about the assembly but that won't get you the sequence. If you want A specific sequence then you need to figure out which sequence it is you want from the assembly because an assembly can have more than one nucleotide sequence and each of those in turn can have multiple annotated proteins. Perhaps I can help you if you can describe what it is that you are trying to achieve in a little more detail.

ADD REPLYlink written 24 days ago by vkkodali1.7k

So what the end game is, is to append a pandas dataframe with the Genbank Assembly ID (i.e. GCA_XXXX.X). I ran a blast of a concatenated .faa file from thousands of assembled genomes(obtained from a BioProject) through making a local database, and do not have the assembly ID's associated the protein accension numbers anymore. So what I need to do is find the complete Genbank assembly ID 's to add to the data frame and continue form there. The problem is when I search the protein acc ID (i.e. MAG94467.1) I get a contig/WGS acc ID but not the genome assembly ID, although when I search manually on NCBI I see it come up. The problem is I need to automate this as I have alot of different BLAST searches and a lot of different proteins so I am trying to do it through Entrez

ADD REPLYlink modified 21 days ago • written 24 days ago by anabaena0
1
gravatar for vkkodali
15 days ago by
vkkodali1.7k
United States
vkkodali1.7k wrote:

Thanks, I understand this better now. For this purpose, here's what I would do:

  1. Download the entire list of GCA accessions
  2. For each GCA accession, download the feature_table.txt file from NCBI FTP. For example, the feature_table.txt file for the assembly GCA_002691735.1 is here. You should be able to write a bash loop to fetch all of the feature_table.txt files. Either that, or you can go to the BioProject page, click on the 'Assembly' link on the right hand side to list all associated assemblies. Then, click on the blue 'Download Assemblies' button and pick 'Feature table' as your file of choice.
  3. Once you have all of the feature table files, concatenate them, and extract the lines with CDS in column 1, and a protein accession in column 11; column 3 has the GCA accession.
ADD COMMENTlink written 15 days ago by vkkodali1.7k
0
gravatar for anabaena
18 days ago by
anabaena0
anabaena0 wrote:

if anyone is interested here is what I was trying to do:

def find_contig_id(protein):
    from Bio import Entrez, SeqIO
    Entrez.email = 'zack.henning@berkeley.edu'
    print(protein)
    eser_handle = Entrez.efetch(id=protein, db ='protein', rettype='gb', retmode='text')
    eser_record = SeqIO.read(eser_handle, 'gb')
    interim = eser_record.annotations['db_source']
    research = interim.lstrip('accession ')
    r_handle = Entrez.efetch(id=research, db='nucleotide', rettype='gb', retmode='text')
    r_record = SeqIO.read(r_handle, 'gb')
    result = r_record.annotations['accessions'][1]
    s_handle = Entrez.esummary(db='nucleotide', id=result, rettype='gb', retmode='text')
    s_record = Entrez.read(s_handle)
    print(s_record[0]['AccessionVersion'])
    return s_record[0]['AccessionVersion']

def Retrieve_Assembly_Accession(contig):
    from Bio import Entrez
    Entrez.email = 'zack.henning@berkeley.edu'
    esearch_file = Entrez.esearch(db='assembly', term=contig)
    esearch_record = Entrez.read(esearch_file)
    for id in esearch_record['IdList']:
        es_handle = Entrez.esummary(db='assembly', id=id, report='full')
        es_record = Entrez.read(es_handle)
        result_final = es_record['DocumentSummarySet']['DocumentSummary'][0]['AssemblyAccession']
        print(result_final)
        return result_final


contig_id = []
accession_id = []


for x in aioA_database['Subject ID']:
    query = find_contig_id(x)
    contig_id.append(query)
    accession = Retrieve_Assembly_Accession(query)
    accession_id.append(accession)
ADD COMMENTlink written 18 days ago by anabaena0

Is this final answer to your original question or just additional information? If latter this should either be added to the original post or moved to a comment on top post.

ADD REPLYlink written 18 days ago by genomax78k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1174 users visited in the last hour