Question: Pmc Xml Parsing
1
gravatar for Alex
9.2 years ago by
Alex1.5k
Theodosius Dobzhansky Center for Genome Bioinformatics
Alex1.5k wrote:

I want to get article texts from PMC xml data with Python.

I used an event driven parsing but it doesn't work well with xml like this:

<p>There is a limimited ... Journal, 
<ext-link ext-link-type="uri" xlink:href="http://...index.asp">
http://...index.asp</ext-link>). Published ... </p>

I thinks the problem with tags-text-other_tag-text-tag sequences. And when I use event driven parsing I get something like this:

Code example:

n = 0
result = ""
for event, element in etree.iterparse(fh, events=['start', 'end']):
        if event == "start":
            n += 2
            result += "%s<%s>\n" % (" "*n, element.tag)
            result += "%s%s\n" % (" "*(n+2), element.text)
        if event == "end":
            result += "%s</%s>\n" % (" "*n, element.tag)
            n -= 2

output of this code:

      

There is a limimited ... Journal, <ext-link> <http://www.aapsj.org/theme_issues/virtual/index.asp> </ext-link> <xref> 1 </xref> <xref> 2 </xref> <xref> 3 </xref> <xref> 4 </xref>

I suspect that there is some simple solution for this case. How do you handle xml like this with Python?

Is there any tools/libraries for converting PMC XML to raw text?

xml pubmed parsing • 4.1k views
ADD COMMENTlink modified 8.4 years ago by Chris Maloney330 • written 9.2 years ago by Alex1.5k
1

show us the code please :-)

ADD REPLYlink written 9.2 years ago by Pierre Lindenbaum129k

the code example added =)

ADD REPLYlink written 9.2 years ago by Alex1.5k

I don't know your python parser but I would guess that it is only DATA oriented (it only expects text data between two tags).

ADD REPLYlink written 9.2 years ago by Pierre Lindenbaum129k

Your googleable code (http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/pubmed2exhibit.xsl) was a helpfull hint for right direction =) Now I know that any SAX parser in my case is bad idea, better to use XSLT

ADD REPLYlink written 9.2 years ago by Alex1.5k

This might help.

ADD REPLYlink modified 10 months ago by RamRS28k • written 9.2 years ago by Michael Schubert7.0k
2
gravatar for Luwening
9.2 years ago by
Luwening50
Luwening50 wrote:

you would see this http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html and the DTD http://dtd.nlm.nih.gov/publishing/ and Biopython may help you parse the xml file(also depend on DTD)

ADD COMMENTlink modified 10 months ago by RamRS28k • written 9.2 years ago by Luwening50
1
gravatar for Chris Maloney
8.8 years ago by
Chris Maloney330
Bethesda, MD
Chris Maloney330 wrote:

This question seems to be specifically about parsing XML with Python, and not about bioinformatics or about the NLM DTDs (JATS). I am not a Python user, but I'd suggest asking on the python mailing list, or Googling "parsing xml with python".

ADD COMMENTlink written 8.8 years ago by Chris Maloney330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 711 users visited in the last hour