Question: How to parse a pubmed abstract that contains html tags or Unicode in python?
0
gravatar for aqsaawan.459
20 months ago by
aqsaawan.4590 wrote:

I'm working on a project of data mining which which use pubmed articles xml files to read in python and parse its data to database, But problem is that some inline tags like <sup></sup> don't read complete text, for example the abstract of a paper. The code to read xml file and the xml file which i'm trying to read is posted here.

parsing xml pubmed python html • 1.0k views
ADD COMMENTlink modified 20 months ago by Pierre Lindenbaum124k • written 20 months ago by aqsaawan.4590
1

Seems like your forgot some input informations :

  • What your xml file looks like ?
  • What did you try in python ?
  • What do you want to achieve ?
  • Submit an example that doesn't work as expected
ADD REPLYlink modified 20 months ago • written 20 months ago by Bastien Hervé4.5k

Please provide a reproducible example of "xml files" and your code.

ADD REPLYlink written 20 months ago by Michael Dondrup47k

The tag intrupts text reading in python like it stops "Categorical variables were compared using the χ2test." at X and don't print further text

    <Abstract>
         <AbstractText> The disease free survival (DFS) and overall survival (OS) were calculated by 
         the Kaplan-Meier method. Categorical variables were compared using the 
         χ<sup>2</sup>test.</AbstractText>
    </Abstract>
ADD REPLYlink modified 20 months ago by Michael Dondrup47k • written 20 months ago by aqsaawan.4590

Please provide a reproducible example of "xml files" and your code. Please edit your original post to make it coherent.

ADD REPLYlink modified 20 months ago • written 20 months ago by Michael Dondrup47k

Sometimes pubmed abstracts may contain rudimentary html code, those shouldn't cause problems, but without the code you are using it is hard to say. I have edited the title to better reflect what this is about.

ADD REPLYlink modified 20 months ago • written 20 months ago by Michael Dondrup47k

Looks like python struggle with the "χ" of your χ2test (chi 2 test). This is a special character, you need to take it into account. Would you please, share a link to your xml file and your python code.

ADD REPLYlink written 20 months ago by Bastien Hervé4.5k

Does python have problems with unicode?

ADD REPLYlink written 20 months ago by Michael Dondrup47k
0
gravatar for Bastien Hervé
20 months ago by
Bastien Hervé4.5k
Limoges, CBRS, France
Bastien Hervé4.5k wrote:

Try this in your python code :

import codecs
with codecs.open("greek.xml", 'r', encoding='ISO-8859-7') as f:
    for line in f:
        print(line)

Assuming that your χ (chi lowercase) is encoded under the ISO-8859-7 norm ( seems to be : https://en.wikipedia.org/wiki/ISO/IEC_8859-7 )

ADD COMMENTlink written 20 months ago by Bastien Hervé4.5k
0
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

How to parse a pubmed abstract that

you use a XSLT stylesheet to extract the fields you need. e.g: Pubmed Xml To Other Format, Such As Enw ; What is the simplest way to go from Pubmed ID to citation, programmatically? ; Importing Pubmed Medline Details Into A Local Rdbms To Execute Data Mining Methods ; parse nxml file from pubmed ; Getting Tab-Delimited Pmids And Abstracts From Pubmed ; https://www.biostars.org/p/14051/; ...

ADD COMMENTlink modified 20 months ago • written 20 months ago by Pierre Lindenbaum124k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 977 users visited in the last hour