Question: parse nxml file from pubmed
0
gravatar for Quak
17 months ago by
Quak290
United States
Quak290 wrote:

I have downloaded pubmed articles from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/ - the files come in nxml format, and I would like to maneuver on each paper and do some NLP.

I already tried two packages,

1)

library(xml2) tt = read_xml("BMC_Cell_Biol/PMC1079802.nxml")

2)

paper1 <- xmlParse("BMC_Cell_Biol/PMC1079802.nxml") xml_data < xmlToList(paper1)

but none of really parse the whole file very well - for example, you can't get into the introduction section ! I was wondering if some one can share some scripts (preferably in R) regarding this, otherwise, I am planning to do it in python ...

pubmed nxml parsing • 1.0k views
ADD COMMENTlink modified 17 months ago by Pierre Lindenbaum118k • written 17 months ago by Quak290

see Pmc Xml Parsing

ADD REPLYlink written 17 months ago by Pierre Lindenbaum118k
2
gravatar for Pierre Lindenbaum
17 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

Using xslt:

example:

$ curl -s "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/comm_use.A-B.xml.tar.gz" | tar xvz  3_Biotech/PMC4624140.nxml  --to-command 'xsltproc --novalid transform.xsl 3_Biotech/PMC4624140.nxml'
3_Biotech/PMC4624140.nxml
Microorganisms are one of the tools used to detoxify toxic compounds present in the environment. Free suspended or immobilized microbial cells can be used for this purpose. However, the immobilized microbial cells have many advantages over free suspended cells under different conditions. For instance, the immobilization of whole cells increases degradation rate owing to increased cell population density, cell wall permeability, and extracellular microbial enzymes stability are improved, cells can be easily removed from the reaction mixture, higher operational stability and storage stability, reuse of immobilized cells in continuous reactors, and allows the bioreactors to operate at flow rates different from the growth rate of the microorganisms (Bettmann and Rehm 1984; Hall and Rao 1989; Cassidy et al. 1996; Ha et al. 2009; Zheng et al. 2009). In addition, the immobilized cell systems act as a protective cover in the presence of toxic compounds and are more resistant to pH or temperature changes. However, free suspended cells have better mass transfer aspects compared to immobilized bacterial or fungal cells (Trevors et al. 1992; Zheng et al. 2009).In the last two decades, there have been intensive researches on the use of immobilized microbial cells as biocatalysts, using numerous reactors like fed batch, semi-continuous fed batch, and continuous packed bed reactor. Each reactor type possesses its disadvantages and advantages, and the choice of a particular type of a reactor may depend on the operational conditions, and inexpensive and non-toxic support inert material for microbial cell immobilization, etc., (Zheng et al. 2009). Bacterial cells immobilized on various matrices have been used extensively for biodegradation of various toxic nitroaromatics such as trinitrotoluene (TNT) (Rho et al. 2001; Ullah et al. 2010), nitrobenzene (Zheng et al. 2009; Qi et al. 2012), 2-nitrotoluene (Mulla et al. 2013), and 3-nitrobenzoate (Mulla et al. 2012).Pendimethalin [N-(1-ethyl propyl) 2,6-dinitro-3,4-xylidine], a common water and soil contaminant, herbicide of dinitroaniline group, is used to control weeds in various crop plants. The use of pendimethalin may adversely affect endangered species of terrestrial and semi-aquatic plants and invertebrates (Kole et al. 1994). One of the best strategies to degrade the hazardous compounds (including pendimethalin) is to use microorganisms. There are few reports on the degradation of pendimethalin by free cells of Fusarium oxysporum and Paecilomyces variotii (Singh and Kulshrestha 1991), Azotobacter chroococcum (Kole et al. 1994), Bacillus circulans (Megadi et al. 2010), and fungus Lecanicillium saksenae (Pinto et al. 2012). However, there is no report on the degradation of pendimethalin by immobilized bacterial or fungal cells. The aim of the present investigation was therefore to compare the pendimethalin degradation by freely suspended and immobilized cells of Bacillus lehensis XJU on various matrices in batch and semi-continuous degradation, and to evaluate the effect of pH, temperature, and storage stability of pendimethalin degradation rate by polyurethane foam (PUF)-immobilized bacterial cells.
ADD COMMENTlink modified 17 months ago • written 17 months ago by Pierre Lindenbaum118k

thanks, If I understand correctly, I made the transform.xls file, and then used it as

cat PMC4729119.nxml | xsltproc --novalid transform.xsl

but then I got,

transform.xsl:1: namespace error : xmlns:xsl: '<a href=' is not a valid URI
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                       ^
transform.xsl:1: parser error : error parsing attribute name
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                        ^
transform.xsl:1: parser error : attributes construct error
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                        ^
transform.xsl:1: parser error : Couldn't find end of Start Tag stylesheet line 1
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                        ^
transform.xsl:1: parser error : Extra content at the end of the document
<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" 
                                        ^
ADD REPLYlink written 17 months ago by Quak290

ah, biostars messed up my code, I'll replace with a gist...

ADD REPLYlink written 17 months ago by Pierre Lindenbaum118k

thanks, just to update you, that now works, but doesn't catch anything ... may be I should play with <xsl:apply-templates select="//sec[title/text() = 'Introduction']"/> I wish there was a way to parse it into a structure format, and then play with different segments.

ADD REPLYlink written 17 months ago by Quak290

worked with PMC4624140, I was just searching for

<sec>
<title>Introduction</title>

may be PMC4729119 has a different structure... could be Abstract or Background

ADD REPLYlink modified 17 months ago • written 17 months ago by Pierre Lindenbaum118k

thanks so much - I can see that ... can you also add where/how I can learn tweaking the pattern matching - e.g I would like to capture methods which in the original xml looks sec-type="materials|methods"><title>Methods</title><sec< p="">

ADD REPLYlink written 17 months ago by Quak290
1

search for an xpath tutorial.

could be something like:

//sec[title/text() = 'materials' or title/text() = 'methods' ]
ADD REPLYlink written 17 months ago by Pierre Lindenbaum118k
0
gravatar for shoujun.gu
17 months ago by
shoujun.gu370
Rockville/MD
shoujun.gu370 wrote:

have you tried xtract?

https://dataguide.nlm.nih.gov/edirect/xtract.html

ADD COMMENTlink written 17 months ago by shoujun.gu370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 995 users visited in the last hour