How to parse a pubmed abstract that contains html tags or Unicode in python?
3
0
Entering edit mode
3.5 years ago

I'm working on a project of data mining which which use pubmed articles xml files to read in python and parse its data to database, But problem is that some inline tags like <sup></sup> don't read complete text, for example the abstract of a paper. The code to read xml file and the xml file which i'm trying to read is posted here.

python xml pubmed parsing html • 2.1k views
ADD COMMENT
1
Entering edit mode

Seems like your forgot some input informations :

  • What your xml file looks like ?
  • What did you try in python ?
  • What do you want to achieve ?
  • Submit an example that doesn't work as expected
ADD REPLY
0
Entering edit mode

Please provide a reproducible example of "xml files" and your code.

ADD REPLY
0
Entering edit mode

The tag intrupts text reading in python like it stops "Categorical variables were compared using the χ2test." at X and don't print further text

    <Abstract>
         <AbstractText> The disease free survival (DFS) and overall survival (OS) were calculated by 
         the Kaplan-Meier method. Categorical variables were compared using the 
         χ<sup>2</sup>test.</AbstractText>
    </Abstract>
ADD REPLY
0
Entering edit mode

Please provide a reproducible example of "xml files" and your code. Please edit your original post to make it coherent.

ADD REPLY
0
Entering edit mode

Sometimes pubmed abstracts may contain rudimentary html code, those shouldn't cause problems, but without the code you are using it is hard to say. I have edited the title to better reflect what this is about.

ADD REPLY
0
Entering edit mode

Looks like python struggle with the "χ" of your χ2test (chi 2 test). This is a special character, you need to take it into account. Would you please, share a link to your xml file and your python code.

ADD REPLY
0
Entering edit mode

Does python have problems with unicode?

ADD REPLY
0
Entering edit mode
3.5 years ago

Try this in your python code :

import codecs
with codecs.open("greek.xml", 'r', encoding='ISO-8859-7') as f:
    for line in f:
        print(line)

Assuming that your χ (chi lowercase) is encoded under the ISO-8859-7 norm ( seems to be : https://en.wikipedia.org/wiki/ISO/IEC_8859-7 )

ADD COMMENT

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6