Question: ncbi entrez protein by taxonomy ID
0
gravatar for flogin
25 days ago by
flogin250
Brazil
flogin250 wrote:

Hey,

I'm studying the Bio.Entrez, to retrivie information from NCBI...

I already made basic scripts to retrieve sequences based on protein or nucleotide IDs, but I'm wondering if I can retrieve all proteins based an specific taxonomy ID....

So I have a 3 column csv file, like this:

Reoviridae,Cardoreovirus,Eriocheir sinensis reovirus
Reoviridae,Mimoreovirus,Micromonas pusilla reovirus
Reoviridae,Orbivirus,African horse sickness virus
Reoviridae,Orbivirus,Bluetongue virus

And I wrote, at the moment, this:

#!/usr/bin/python3
# -*- coding: utf-8 -*-
from Bio import Entrez
import argparse, csv
import xml.etree.ElementTree as ET

parser = argparse.ArgumentParser(description = 'This script a csv file and returns protein information by viral family.')
parser.add_argument("-in", "--input", help="CSV file with 3 columns", required=True)
args = parser.parse_args()
input_file = args.input

with open(input_file,'r') as in_file:
    reader_in_file = csv.reader(in_file,delimiter=',')
    viral_family_lst = []
    for line in reader_in_file:
        viral_family = line[2].rstrip('\n')
        viral_family_lst.append(viral_family)

for viral_family in viral_family_lst:
    handle_id_var = Entrez.esearch(db="Taxonomy", term=viral_family,retmode='xml')
    tree = ET.parse(handle_id_var)
    root = tree.getroot()
    for app in root.findall('IdList'):
        for l in app.findall('Id'):
            id = l.text
            print(id)

So, at the moment, this script returns the taxonomy ID for each "viral specie", and idk how I can use this IDs to retrieve all proteins for each virus....

entrez protein python ncbi • 118 views
ADD COMMENTlink modified 24 days ago by user_without_id140 • written 25 days ago by flogin250
1

Using EntrezDirect. Translate into python as needed:

$ esearch -db taxonomy -query "333387 [taxid]" | elink -target protein | efetch -format fasta | grep "^>"
>AAY88865.2 orf1ab polyprotein [Bat SARS coronavirus HKU3-1]
>AAY88875.1 hypothetical protein orf9b [Bat SARS coronavirus HKU3-1]
>AAY88874.1 nucleocapsid phosphoprotein [Bat SARS coronavirus HKU3-1]
>AAY88873.1 hypothetical protein orf8 [Bat SARS coronavirus HKU3-1]
>AAY88872.1 hypothetical protein orf7b [Bat SARS coronavirus HKU3-1]
>AAY88871.1 hypothetical protein orf7a [Bat SARS coronavirus HKU3-1]
>AAY88870.1 hypothetical protein orf6 [Bat SARS coronavirus HKU3-1]
>AAY88869.1 membrane glycoprotein [Bat SARS coronavirus HKU3-1]
>AAY88868.1 small membrane protein [Bat SARS coronavirus HKU3-1]
>AAY88867.1 hypothetical protein orf3a [Bat SARS coronavirus HKU3-1]
>AAY88866.1 spike glycoprotein [Bat SARS coronavirus HKU3-1]
ADD REPLYlink modified 24 days ago • written 24 days ago by genomax87k

thanks genomax, I'll test a python version from this line!

ADD REPLYlink written 24 days ago by flogin250
3
gravatar for user_without_id
24 days ago by
Taiwan
user_without_id140 wrote:

Something like this?

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import json
import pandas as pd

from Bio import Entrez

Entrez.email = "al@eg.com"

import io

input = io.StringIO("""
family,genus,species
Reoviridae,Cardoreovirus,Eriocheir sinensis reovirus
Reoviridae,Mimoreovirus,Micromonas pusilla reovirus
Reoviridae,Orbivirus,African horse sickness virus
Reoviridae,Orbivirus,Bluetongue virus
""")

df = pd.read_csv(input, sep=',')

# To load from file, do (check if has column names (header) or not):
# df = pd.read_csv(filename, sep=',', header=None)

print("List of species:", list(df.species))

# Entrez esearch result limit
RETMAX = 33


def get_ids(response) -> list:
    j = json.loads(response.read())
    return list(j['esearchresult']['idlist'])


for species in df.species:
    txids = get_ids(Entrez.esearch(db="Taxonomy", term=species, retmode="json"))
    for txid in txids:
        prids = get_ids(Entrez.esearch(db="Protein", term=F"txid{txid}[Organism:noexp]", retmax=RETMAX, retmode="json"))
        print(F"Species {species} ({txid}), protein IDs: {prids}")
        for prid in prids:
            # print(json.loads(Entrez.esummary(db="Protein", id=prid, retmode="json").read())['result'][prid])
            fasta = Entrez.efetch(db="Protein", id=prid, rettype="fasta", retmode="text").read()
            print(fasta)
ADD COMMENTlink written 24 days ago by user_without_id140

Thanks, It really works, I just make some adjusts to wrote the protein sequences on specific file per species, thanks !

ADD REPLYlink written 24 days ago by flogin250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 945 users visited in the last hour