Question: How to parse a pubmed abstract that contains html tags or Unicode in python?
0
gravatar for aqsaawan.459
2.7 years ago by
aqsaawan.4590 wrote:

I'm working on a project of data mining which which use pubmed articles xml files to read in python and parse its data to database, But problem is that some inline tags like <sup></sup> don't read complete text, for example the abstract of a paper. The code to read xml file and the xml file which i'm trying to read is posted here.

parsing xml pubmed python html • 1.5k views
ADD COMMENTlink modified 2.7 years ago by Pierre Lindenbaum131k • written 2.7 years ago by aqsaawan.4590
1

Seems like your forgot some input informations :

  • What your xml file looks like ?
  • What did you try in python ?
  • What do you want to achieve ?
  • Submit an example that doesn't work as expected
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Bastien Hervé4.9k

Please provide a reproducible example of "xml files" and your code.

ADD REPLYlink written 2.7 years ago by Michael Dondrup48k

The tag intrupts text reading in python like it stops "Categorical variables were compared using the χ2test." at X and don't print further text

    <Abstract>
         <AbstractText> The disease free survival (DFS) and overall survival (OS) were calculated by 
         the Kaplan-Meier method. Categorical variables were compared using the 
         χ<sup>2</sup>test.</AbstractText>
    </Abstract>
ADD REPLYlink modified 2.7 years ago by Michael Dondrup48k • written 2.7 years ago by aqsaawan.4590

Please provide a reproducible example of "xml files" and your code. Please edit your original post to make it coherent.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Michael Dondrup48k

Sometimes pubmed abstracts may contain rudimentary html code, those shouldn't cause problems, but without the code you are using it is hard to say. I have edited the title to better reflect what this is about.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Michael Dondrup48k

Looks like python struggle with the "χ" of your χ2test (chi 2 test). This is a special character, you need to take it into account. Would you please, share a link to your xml file and your python code.

ADD REPLYlink written 2.7 years ago by Bastien Hervé4.9k

Does python have problems with unicode?

ADD REPLYlink written 2.7 years ago by Michael Dondrup48k
0
gravatar for Bastien Hervé
2.7 years ago by
Bastien Hervé4.9k
Karolinska Institutet, Sweden
Bastien Hervé4.9k wrote:

Try this in your python code :

import codecs
with codecs.open("greek.xml", 'r', encoding='ISO-8859-7') as f:
    for line in f:
        print(line)

Assuming that your χ (chi lowercase) is encoded under the ISO-8859-7 norm ( seems to be : https://en.wikipedia.org/wiki/ISO/IEC_8859-7 )

ADD COMMENTlink written 2.7 years ago by Bastien Hervé4.9k
0
gravatar for Pierre Lindenbaum
2.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

How to parse a pubmed abstract that

you use a XSLT stylesheet to extract the fields you need. e.g: Pubmed Xml To Other Format, Such As Enw ; What is the simplest way to go from Pubmed ID to citation, programmatically? ; Importing Pubmed Medline Details Into A Local Rdbms To Execute Data Mining Methods ; parse nxml file from pubmed ; Getting Tab-Delimited Pmids And Abstracts From Pubmed ; https://www.biostars.org/p/14051/; ...

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Pierre Lindenbaum131k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1045 users visited in the last hour