Question

accessing ncbi nucleotide section with python

0

Entering edit mode

4.4 years ago

flogin ▴ 280

I'm working in a little script to access NCBI nucleotides section with an ID list and recovery the information about host of each ID.

So I write this:

import requests as req
link_nucleotide = 'https://www.ncbi.nlm.nih.gov/nuccore/'

lst_terms = ['MG873553.1','MG873552.1','MG873551.1','MG873550.1','MG251660.1','MG251659.1','MG251658.1','MG251657.1','KX650071.1','KX650070.1']
for i in lst_terms:
    link_id = link_nucleotide+i
    response = req.get(link_id)
    my_file = response.text
    print(my_file)

But, when I read the output, it does not exist any filed called "/host=", as we can see in https://www.ncbi.nlm.nih.gov/nuccore/MG873553.1 (/host="Elymana sulphurella", for the fist ID).

So, there is another form to access the text of this url to recovery the host informaton?

Best.

python ncbi ids match • 1.1k views

ADD COMMENT • link 4.4 years ago by flogin ▴ 280

1

Entering edit mode

If you need to use Python, I suggest using BioPython, specifically the Entrez module. The documentation and the numerous posts in biostars should get you up and running.

If python is not a requirement, you should check out Entrez Direct to download NCBI data from the command line.

ADD REPLY • link 4.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Thanks vkkdali, I'm using this module now, everything is working well, but just a silly question:

I'm using this code:

handle = Entrez.efetch(db="Taxonomy",id='MG873553.1', retmode="xml")
record = Entrez.read(handle)
print(record[0]['LineageEx'])

That returns:

[DictElement({'TaxId': '131567', 'ScientificName': 'cellular organisms', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '2759', 'ScientificName': 'Eukaryota', 'Rank': 'superkingdom'}, attributes={}), DictElement({'TaxId': '33154', 'ScientificName': 'Opisthokonta', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '33208', 'ScientificName': 'Metazoa', 'Rank': 'kingdom'}, attributes={}), DictElement({'TaxId': '6072', 'ScientificName': 'Eumetazoa', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '33213', 'ScientificName': 'Bilateria', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '33317', 'ScientificName': 'Protostomia', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '1206794', 'ScientificName': 'Ecdysozoa', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '88770', 'ScientificName': 'Panarthropoda', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '6656', 'ScientificName': 'Arthropoda', 'Rank': 'phylum'}, attributes={}), DictElement({'TaxId': '197563', 'ScientificName': 'Mandibulata', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '197562', 'ScientificName': 'Pancrustacea', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '6960', 'ScientificName': 'Hexapoda', 'Rank': 'subphylum'}, attributes={}), DictElement({'TaxId': '50557', 'ScientificName': 'Insecta', 'Rank': 'class'}, attributes={}), DictElement({'TaxId': '85512', 'ScientificName': 'Dicondylia', 'Rank': 'no rank'}, attributes={}), DictElement({'TaxId': '7496', 'ScientificName': 'Pterygota', 'Rank': 'subclass'}, attributes={}), DictElement({'TaxId': '33340', 'ScientificName': 'Neoptera', 'Rank': 'infraclass'}, attributes={}), DictElement({'TaxId': '33342', 'ScientificName': 'Paraneoptera', 'Rank': 'cohort'}, attributes={}), DictElement({'TaxId': '7524', 'ScientificName': 'Hemiptera', 'Rank': 'order'}, attributes={}), DictElement({'TaxId': '1955247', 'ScientificName': 'Auchenorrhyncha', 'Rank': 'suborder'}, attributes={}), DictElement({'TaxId': '33365', 'ScientificName': 'Cicadomorpha', 'Rank': 'infraorder'}, attributes={}), DictElement({'TaxId': '33368', 'ScientificName': 'Membracoidea', 'Rank': 'superfamily'}, attributes={}), DictElement({'TaxId': '30102', 'ScientificName': 'Cicadellidae', 'Rank': 'family'}, attributes={}), DictElement({'TaxId': '33372', 'ScientificName': 'Deltocephalinae', 'Rank': 'subfamily'}, attributes={}), DictElement({'TaxId': '706723', 'ScientificName': 'Elymana', 'Rank': 'genus'}, attributes={})]

How can I read this output to access specific ranks? for example: phylum = Arthropoda order = Hemiptera

thanks

ADD REPLY • link 4.4 years ago by flogin ▴ 280

0

Entering edit mode

I'd do something like this:

for i in record[0]['LineageEx']:
  if 'phylum' in i.values() or 'order' in i.values():
    print(i)

ADD REPLY • link 4.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

thanks vkkodali, I put in this form to recovery the taxonomic levels that I want:

for x in record_host_tax[0]["LineageEx"]:
    if 'phylum' in x.values():
        phylum = x['ScientificName']
    if 'class' in x.values():
        classe = x['ScientificName']
    if 'order' in x.values():
        order = x['ScientificName']
    if 'family' in x.values():
        family = x['ScientificName']

ADD REPLY • link 4.4 years ago by flogin ▴ 280

0

Entering edit mode

This is perfectly fine but does not scale well if you have a whole bunch of ranks for which you need to collect data. Here's an alternative:

blessed_ranks = set(['phylum', 'class', 'order', 'family'])
rank_dict = {}

for x in record[0]['LineageEx']:
    if len(set(x.values()) & blessed_ranks) > 0:
        rank = list(set(x.values()) & set(['phylum', 'class', 'order', 'family']))
        rank_dict[rank[0]] = x['ScientificName']

print(rank_dict)

ADD REPLY • link 4.4 years ago by vkkodali_ncbi ★ 3.7k