ncbi entrez protein by taxonomy ID
1
2
Entering edit mode
3.7 years ago
flogin ▴ 280

Hey,

I'm studying the Bio.Entrez, to retrivie information from NCBI...

I already made basic scripts to retrieve sequences based on protein or nucleotide IDs, but I'm wondering if I can retrieve all proteins based an specific taxonomy ID....

So I have a 3 column csv file, like this:

Reoviridae,Cardoreovirus,Eriocheir sinensis reovirus
Reoviridae,Mimoreovirus,Micromonas pusilla reovirus
Reoviridae,Orbivirus,African horse sickness virus
Reoviridae,Orbivirus,Bluetongue virus

And I wrote, at the moment, this:

#!/usr/bin/python3
# -*- coding: utf-8 -*-
from Bio import Entrez
import argparse, csv
import xml.etree.ElementTree as ET

parser = argparse.ArgumentParser(description = 'This script a csv file and returns protein information by viral family.')
parser.add_argument("-in", "--input", help="CSV file with 3 columns", required=True)
args = parser.parse_args()
input_file = args.input

with open(input_file,'r') as in_file:
    reader_in_file = csv.reader(in_file,delimiter=',')
    viral_family_lst = []
    for line in reader_in_file:
        viral_family = line[2].rstrip('\n')
        viral_family_lst.append(viral_family)

for viral_family in viral_family_lst:
    handle_id_var = Entrez.esearch(db="Taxonomy", term=viral_family,retmode='xml')
    tree = ET.parse(handle_id_var)
    root = tree.getroot()
    for app in root.findall('IdList'):
        for l in app.findall('Id'):
            id = l.text
            print(id)

So, at the moment, this script returns the taxonomy ID for each "viral specie", and idk how I can use this IDs to retrieve all proteins for each virus....

python entrez ncbi protein • 1.3k views
ADD COMMENT
3
Entering edit mode

Using EntrezDirect. Translate into python as needed:

$ esearch -db taxonomy -query "333387 [taxid]" | elink -target protein | efetch -format fasta | grep "^>"
>AAY88865.2 orf1ab polyprotein [Bat SARS coronavirus HKU3-1]
>AAY88875.1 hypothetical protein orf9b [Bat SARS coronavirus HKU3-1]
>AAY88874.1 nucleocapsid phosphoprotein [Bat SARS coronavirus HKU3-1]
>AAY88873.1 hypothetical protein orf8 [Bat SARS coronavirus HKU3-1]
>AAY88872.1 hypothetical protein orf7b [Bat SARS coronavirus HKU3-1]
>AAY88871.1 hypothetical protein orf7a [Bat SARS coronavirus HKU3-1]
>AAY88870.1 hypothetical protein orf6 [Bat SARS coronavirus HKU3-1]
>AAY88869.1 membrane glycoprotein [Bat SARS coronavirus HKU3-1]
>AAY88868.1 small membrane protein [Bat SARS coronavirus HKU3-1]
>AAY88867.1 hypothetical protein orf3a [Bat SARS coronavirus HKU3-1]
>AAY88866.1 spike glycoprotein [Bat SARS coronavirus HKU3-1]
ADD REPLY
0
Entering edit mode

thanks genomax, I'll test a python version from this line!

ADD REPLY
3
Entering edit mode
3.7 years ago

Something like this?

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import json
import pandas as pd

from Bio import Entrez

Entrez.email = "al@eg.com"

import io

input = io.StringIO("""
family,genus,species
Reoviridae,Cardoreovirus,Eriocheir sinensis reovirus
Reoviridae,Mimoreovirus,Micromonas pusilla reovirus
Reoviridae,Orbivirus,African horse sickness virus
Reoviridae,Orbivirus,Bluetongue virus
""")

df = pd.read_csv(input, sep=',')

# To load from file, do (check if has column names (header) or not):
# df = pd.read_csv(filename, sep=',', header=None)

print("List of species:", list(df.species))

# Entrez esearch result limit
RETMAX = 33


def get_ids(response) -> list:
    j = json.loads(response.read())
    return list(j['esearchresult']['idlist'])


for species in df.species:
    txids = get_ids(Entrez.esearch(db="Taxonomy", term=species, retmode="json"))
    for txid in txids:
        prids = get_ids(Entrez.esearch(db="Protein", term=F"txid{txid}[Organism:noexp]", retmax=RETMAX, retmode="json"))
        print(F"Species {species} ({txid}), protein IDs: {prids}")
        for prid in prids:
            # print(json.loads(Entrez.esummary(db="Protein", id=prid, retmode="json").read())['result'][prid])
            fasta = Entrez.efetch(db="Protein", id=prid, rettype="fasta", retmode="text").read()
            print(fasta)
ADD COMMENT
0
Entering edit mode

Thanks, It really works, I just make some adjusts to wrote the protein sequences on specific file per species, thanks !

ADD REPLY

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6