Parse the Xml response from Entrez Db=bioproject Using Biopython eFetch
2
0
Entering edit mode
5.9 years ago

I want to parse XML response obtained from bioproject DB using efetch module in Biopython.

Here is my code:

from Bio import Entrez
Entrez.email = "myemail@company.org"
handle = Entrez.efetch(db="bioproject", id="55465", rettype='gb',retmode="xml")
records = Entrez.parse(handle)
for record in records:
print record

but this gives the following error:

Bio.Entrez.Parser.ValidationError: Failed to find tag 'RecordSet' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.

instead if I try this, it works but it gives the XML lines as it is (no parsing)

handle = Entrez.efetch(db="bioproject", id="55465", rettype='gb',retmode="xml")
print line

Can anyone please guide as what is the right way to parse the XML response in this case?

xml biopython entrez efetch parsing • 4.3k views
2
Entering edit mode

why don't you try what Bio.Entrez.Parser. said  ? "please call Bio.Entrez.read or Bio.Entrez.parse with validate=False."

1
Entering edit mode

Thanks for replying, I did try putting in validate=False in parse but nothing gets printed. I was reading on regarding efetch:

If validate is True (default), the parser will validate the XML file against the DTD, and raise an error if the XML file contains tags that are not represented in the DTD. If validate is False, the parser will simply skip such tags.

So looks like when validate is set to False, those tags are getting skipped and no output shows up. Maybe the XML response is not well-formed in this case.

3
Entering edit mode
5.9 years ago
David W 4.8k

This seems to be a problem with the Bioproject database not having a DTD:

http://lists.open-bio.org/pipermail/biopython/2015-May/015632.html

That thread has a work-around, but you might also want to try the development version of Biopython (https://github.com/biopython/biopython)to see if the new work Peter mentions will "just work".

EDIT

I'm not sure it will make any difference, but 'gb' is not a rettype available for the Bioporject database:

http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

0
Entering edit mode

Thanks for replying. Thanks for noticing about rettype parameter.  Will try the development version of biopython to see if it works.

1
Entering edit mode
5.9 years ago
beegrackle ▴ 90

I was working on the same problem two weeks ago; I finally gave up and used Beautiful Soup to parse the xml results by going through, studying the xmls and the different parent nodes of the child nodes I want, and checking that the full path for each item I wanted exists before I put it in a dictionary (depending on how much of the bioproject entry someone filled out, some levels are missing and you will get an error trying to call them). It's not pretty but I finally had to give up looking for something nicer and just get it done. Hope this helps.

import BeautifulSoup as BS
import lxml

handle = Entrez.efetch(db="bioproject", retmode="xml", id=i)
soup = BS(bio_file, 'xml')
D = {}
intro = soup.RecordSet.DocumentSummary.Project
if intro.ProjectDescr.Name:
D['Name'] = intro.ProjectDescr.Name

0
Entering edit mode

Thanks beegrackle for suggestion.

I also started parsing xmls but I used xml.etree. BeautifulSoup looks really good. Really appreciate your help.