Question: Parse the Xml response from Entrez Db=bioproject Using Biopython eFetch
0
gravatar for Prasad
3.7 years ago by
Prasad 10
United States
Prasad 10 wrote:

I want to parse XML response obtained from bioproject DB using efetch module in Biopython.

Here is my code:

from Bio import Entrez
Entrez.email = "myemail@company.org"
handle = Entrez.efetch(db="bioproject", id="55465", rettype='gb',retmode="xml")
records = Entrez.parse(handle) 
for record in records: 
      print record

but this gives the following error:

Bio.Entrez.Parser.ValidationError: Failed to find tag 'RecordSet' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.

instead if I try this, it works but it gives the XML lines as it is (no parsing)

handle = Entrez.efetch(db="bioproject", id="55465", rettype='gb',retmode="xml") 
readlines = handle.readlines()
for line in readlines:
    print line

Can anyone please guide as what is the right way to parse the XML response in this case?

 

parsing biopython entrez efetch xml • 2.9k views
ADD COMMENTlink modified 3.7 years ago by beegrackle90 • written 3.7 years ago by Prasad 10
2

why don't you try what Bio.Entrez.Parser. said  ? "please call Bio.Entrez.read or Bio.Entrez.parse with validate=False."

ADD REPLYlink written 3.7 years ago by Pierre Lindenbaum117k
1

Thanks for replying, I did try putting in validate=False in parse but nothing gets printed. I was reading on regarding efetch:

If validate is True (default), the parser will validate the XML file against the DTD, and raise an error if the XML file contains tags that are not represented in the DTD. If validate is False, the parser will simply skip such tags.

So looks like when validate is set to False, those tags are getting skipped and no output shows up. Maybe the XML response is not well-formed in this case. 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Prasad 10
3
gravatar for David W
3.7 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

This seems to be a problem with the Bioproject database not having a DTD:

http://lists.open-bio.org/pipermail/biopython/2015-May/015632.html

That thread has a work-around, but you might also want to try the development version of Biopython (https://github.com/biopython/biopython)to see if the new work Peter mentions will "just work". 

EDIT

I'm not sure it will make any difference, but 'gb' is not a rettype available for the Bioporject database:

http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by David W4.7k

Thanks for replying. Thanks for noticing about rettype parameter.  Will try the development version of biopython to see if it works. 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Prasad 10
1
gravatar for beegrackle
3.7 years ago by
beegrackle90
United States
beegrackle90 wrote:

Hi Prasad:

I was working on the same problem two weeks ago; I finally gave up and used Beautiful Soup to parse the xml results by going through, studying the xmls and the different parent nodes of the child nodes I want, and checking that the full path for each item I wanted exists before I put it in a dictionary (depending on how much of the bioproject entry someone filled out, some levels are missing and you will get an error trying to call them). It's not pretty but I finally had to give up looking for something nicer and just get it done. Hope this helps.

    import BeautifulSoup as BS
    import lxml

    handle = Entrez.efetch(db="bioproject", retmode="xml", id=i)
    bio_file = handle.read()
    soup = BS(bio_file, 'xml')
    D = {}
    intro = soup.RecordSet.DocumentSummary.Project
    if intro.ProjectDescr.Name:
        D['Name'] = intro.ProjectDescr.Name

ADD COMMENTlink written 3.7 years ago by beegrackle90

Thanks beegrackle for suggestion. 

I also started parsing xmls but I used xml.etree. BeautifulSoup looks really good. Really appreciate your help.

 

 

 

ADD REPLYlink written 3.7 years ago by Prasad 10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 752 users visited in the last hour