Question

ncbi entrez protein by taxonomy ID

2

Entering edit mode

3.7 years ago

flogin ▴ 280

Hey,

I'm studying the Bio.Entrez, to retrivie information from NCBI...

I already made basic scripts to retrieve sequences based on protein or nucleotide IDs, but I'm wondering if I can retrieve all proteins based an specific taxonomy ID....

So I have a 3 column csv file, like this:

Reoviridae,Cardoreovirus,Eriocheir sinensis reovirus
Reoviridae,Mimoreovirus,Micromonas pusilla reovirus
Reoviridae,Orbivirus,African horse sickness virus
Reoviridae,Orbivirus,Bluetongue virus

And I wrote, at the moment, this:

#!/usr/bin/python3
# -*- coding: utf-8 -*-
from Bio import Entrez
import argparse, csv
import xml.etree.ElementTree as ET

parser = argparse.ArgumentParser(description = 'This script a csv file and returns protein information by viral family.')
parser.add_argument("-in", "--input", help="CSV file with 3 columns", required=True)
args = parser.parse_args()
input_file = args.input

with open(input_file,'r') as in_file:
    reader_in_file = csv.reader(in_file,delimiter=',')
    viral_family_lst = []
    for line in reader_in_file:
        viral_family = line[2].rstrip('\n')
        viral_family_lst.append(viral_family)

for viral_family in viral_family_lst:
    handle_id_var = Entrez.esearch(db="Taxonomy", term=viral_family,retmode='xml')
    tree = ET.parse(handle_id_var)
    root = tree.getroot()
    for app in root.findall('IdList'):
        for l in app.findall('Id'):
            id = l.text
            print(id)

So, at the moment, this script returns the taxonomy ID for each "viral specie", and idk how I can use this IDs to retrieve all proteins for each virus....

python entrez ncbi protein • 1.3k views

ADD COMMENT • link updated 3.7 years ago by user_without_id ▴ 150 • written 3.7 years ago by flogin ▴ 280

3

Entering edit mode

Using EntrezDirect. Translate into python as needed:

$ esearch -db taxonomy -query "333387 [taxid]" | elink -target protein | efetch -format fasta | grep "^>"
>AAY88865.2 orf1ab polyprotein [Bat SARS coronavirus HKU3-1]
>AAY88875.1 hypothetical protein orf9b [Bat SARS coronavirus HKU3-1]
>AAY88874.1 nucleocapsid phosphoprotein [Bat SARS coronavirus HKU3-1]
>AAY88873.1 hypothetical protein orf8 [Bat SARS coronavirus HKU3-1]
>AAY88872.1 hypothetical protein orf7b [Bat SARS coronavirus HKU3-1]
>AAY88871.1 hypothetical protein orf7a [Bat SARS coronavirus HKU3-1]
>AAY88870.1 hypothetical protein orf6 [Bat SARS coronavirus HKU3-1]
>AAY88869.1 membrane glycoprotein [Bat SARS coronavirus HKU3-1]
>AAY88868.1 small membrane protein [Bat SARS coronavirus HKU3-1]
>AAY88867.1 hypothetical protein orf3a [Bat SARS coronavirus HKU3-1]
>AAY88866.1 spike glycoprotein [Bat SARS coronavirus HKU3-1]

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

thanks genomax, I'll test a python version from this line!

ADD REPLY • link 3.7 years ago by flogin ▴ 280

score 3 · Accepted Answer · 2020-07-21

Something like this?

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import json
import pandas as pd

from Bio import Entrez

Entrez.email = "al@eg.com"

import io

input = io.StringIO("""
family,genus,species
Reoviridae,Cardoreovirus,Eriocheir sinensis reovirus
Reoviridae,Mimoreovirus,Micromonas pusilla reovirus
Reoviridae,Orbivirus,African horse sickness virus
Reoviridae,Orbivirus,Bluetongue virus
""")

df = pd.read_csv(input, sep=',')

# To load from file, do (check if has column names (header) or not):
# df = pd.read_csv(filename, sep=',', header=None)

print("List of species:", list(df.species))

# Entrez esearch result limit
RETMAX = 33


def get_ids(response) -> list:
    j = json.loads(response.read())
    return list(j['esearchresult']['idlist'])


for species in df.species:
    txids = get_ids(Entrez.esearch(db="Taxonomy", term=species, retmode="json"))
    for txid in txids:
        prids = get_ids(Entrez.esearch(db="Protein", term=F"txid{txid}[Organism:noexp]", retmax=RETMAX, retmode="json"))
        print(F"Species {species} ({txid}), protein IDs: {prids}")
        for prid in prids:
            # print(json.loads(Entrez.esummary(db="Protein", id=prid, retmode="json").read())['result'][prid])
            fasta = Entrez.efetch(db="Protein", id=prid, rettype="fasta", retmode="text").read()
            print(fasta)