Avoiding empty columns when using EDirect's xtract

0

Entering edit mode

10.4 years ago

glwinsor • 0

Hi,

I'm evaluating NCBI's EDirect command line tool for the first time (as an alternative to using E-utilities and parsing the XML). It looks pretty good so far but I'm having trouble dealing with cases where I attempt to extract individual fields from a docsum and one of the fields are empty. For example, when I do a very simple query of the bioproject database for all "Pseudomonas aeruginosa PAO1" records:

esearch -db bioproject -query "Pseudomonas aeruginosa PAO1" | efetch -format docsum  | xtract -pattern DocumentSummary -element  Organism_Strain Project_Acc

some of the records returned do not have an Organism_Strain value returned and the Project_Acc value gets shifted to the first column:

MPAO1    PRJNA273663
PRJEB8227
PRJNA268347
PAO1    PRJNA266474
PRJNA265367
PA01    PRJNA264943
PAO1    PRJNA258237
PRJNA256959
PRJNA252740
PRJNA252560

I would like to have xtract return an empty string or even "-" for cases where an Organism_name is missing so that it would have two columns instead of a combination of one or two column lines:

PA01    PRJNA264943
PAO1    PRJNA258237
-            PRJNA256959
-            PRJNA252740
-            PRJNA252560

I've looked over the official NCBI documentation and tried to assign each field to an initialized variable doing different combinations of the following:

esearch -db bioproject -query "Pseudomonas aeruginosa PAO1" | efetch -format docsum | xtract -pattern DocumentSummary -element "&STRAIN" Organism_Strain  -STRAIN "(-)" -element Bioproject_acc

but I can't seem to find the correct syntax to make things work the way I want. Does anyone have any experience with this kind of problem?

Thanks

NCBI EDirect • 2.3k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by glwinsor • 0

1

Entering edit mode

10.4 years ago

Pierre Lindenbaum 166k

Use the following XSLT stylesheet

	<?xml version="1.0" encoding="UTF-8"?>
	<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
	<xsl:output method="text"/>
	<xsl:template match="/">
	<xsl:apply-templates select="RecordSet/DocumentSummary"/>
	</xsl:template>
	<xsl:template match="DocumentSummary">
	<xsl:choose>
	<xsl:when test="Project/ProjectType/ProjectTypeSubmission/Target/Organism/Strain">
	<xsl:value-of select="Project/ProjectType/ProjectTypeSubmission/Target/Organism/Strain/text()"/>
	</xsl:when>
	<xsl:otherwise>
	<xsl:text>-</xsl:text>
	</xsl:otherwise>
	</xsl:choose>
	<xsl:text> </xsl:text>
	<xsl:value-of select="Project/ProjectID/ArchiveID/@accession"/>
	<xsl:text>
	</xsl:text>
	</xsl:template>
	</xsl:stylesheet>

view raw stylesheet.xsl hosted with ❤ by GitHub

curl -s 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=bioproject&id=273663,268347,266474,265367,264943,258237,256959,253371,252560,246577,246187,243443,243442,243441,243440,243439,243438,238552,234392,231042,229797,229796,228966,227545,225027,225026,222984,213568,213564,212862,212861,210078,209744,209743,206372,204979,203313,202063,201067,201024,197381,197270,186945,178535,169508,163909,153781,153301,152899,152815,152481,151585,150221,149213,149177,149097,148359,146229,144393,143701,141069,140783,140437,139731,139627,139393,134617,128339,127841,126779,125979,119669,117261,111043,109535,108523,108169,106267,103929,102875,102387,101467,100923,100723,99427,98977,98025,97617,96113,95321,95131,94055,93093,92577,91591,83447,66135,57945,331' | xsltproc stylesheet.xsl -

MPAO1    PRJNA273663
-    PRJNA268347
PAO1    PRJNA266474
-    PRJNA265367
PA01    PRJNA264943
PAO1    PRJNA258237
-    PRJNA256959
-    PRJNA253371
-    PRJNA252560
PA01    PRJNA246577
-    PRJNA246187
PAO1    PRJNA243443
PAO1    PRJNA243442
PAO1    PRJNA243441
PAO1    PRJNA243440
PAO1    PRJNA243439
PAO1    PRJNA243438
-    PRJNA238552
-    PRJNA234392
PAO1    PRJNA231042
PA01    PRJNA229797
PA01    PRJNA229796
-    PRJNA228966
PAO1-GFP    PRJNA227545
PAO1-VE13    PRJNA225027
PAO1-VE2    PRJNA225026
-    PRJNA222984
PAO1    PRJNA213568
PA01    PRJNA213564
PAO1-VE13    PRJNA212862
PAO1-VE2    PRJNA212861
-    PRJNA210078
PAO1    PRJNA209744
PAO1-CipR    PRJNA209743
PAK    PRJNA206372
-    PRJNA204979
-    PRJNA203313
PAO1-CipR    PRJNA202063
-    PRJNA201067
PAO1    PRJNA201024
PAO1    PRJNA197381
-    PRJNA197270
-    PRJNA186945
-    PRJNA178535
PAO1    PRJNA169508
-    PRJNA163909
-    PRJNA153781
-    PRJNA153301
-    PRJNA152899
-    PRJNA152815
-    PRJNA152481
-    PRJNA151585
-    PRJNA150221
-    PRJNA149213
-    PRJNA149177
-    PRJNA149097
-    PRJNA148359
-    PRJNA146229
-    PRJNA144393
-    PRJNA143701
-    PRJNA141069
-    PRJNA140783
-    PRJNA140437
-    PRJNA139731
-    PRJNA139627
-    PRJNA139393
-    PRJNA134617
-    PRJNA128339
-    PRJNA127841
-    PRJNA126779
-    PRJNA125979
-    PRJNA119669
-    PRJNA117261
-    PRJNA111043
-    PRJNA109535
-    PRJNA108523
-    PRJNA108169
-    PRJNA106267
-    PRJNA103929
-    PRJNA102875
-    PRJNA102387
-    PRJNA101467
-    PRJNA100923
-    PRJNA100723
-    PRJNA99427
-    PRJNA98977
-    PRJNA98025
-    PRJNA97617
-    PRJNA96113
-    PRJNA95321
-    PRJNA95131
-    PRJNA94055
-    PRJNA93093
-    PRJNA92577
-    PRJNA91591
PAO1H2O    PRJNA83447
PAK    PRJNA66135
PAO1    PRJNA57945
PAO1    PRJNA331

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks for this. I overly simplified my example and requirements for the sake of clarity but it looks like the XLST approach will handle them very well!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by glwinsor • 0

Login before adding your answer.