Hi,
I'm evaluating NCBI's EDirect command line tool for the first time (as an alternative to using E-utilities and parsing the XML). It looks pretty good so far but I'm having trouble dealing with cases where I attempt to extract individual fields from a docsum and one of the fields are empty. For example, when I do a very simple query of the bioproject database for all "Pseudomonas aeruginosa PAO1" records:
esearch -db bioproject -query "Pseudomonas aeruginosa PAO1" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism_Strain Project_Acc
some of the records returned do not have an Organism_Strain
value returned and the Project_Acc
value gets shifted to the first column:
MPAO1 PRJNA273663
PRJEB8227
PRJNA268347
PAO1 PRJNA266474
PRJNA265367
PA01 PRJNA264943
PAO1 PRJNA258237
PRJNA256959
PRJNA252740
PRJNA252560
I would like to have xtract
return an empty string or even "-"
for cases where an Organism_name
is missing so that it would have two columns instead of a combination of one or two column lines:
PA01 PRJNA264943
PAO1 PRJNA258237
- PRJNA256959
- PRJNA252740
- PRJNA252560
I've looked over the official NCBI documentation and tried to assign each field to an initialized variable doing different combinations of the following:
esearch -db bioproject -query "Pseudomonas aeruginosa PAO1" | efetch -format docsum | xtract -pattern DocumentSummary -element "&STRAIN" Organism_Strain -STRAIN "(-)" -element Bioproject_acc
but I can't seem to find the correct syntax to make things work the way I want. Does anyone have any experience with this kind of problem?
Thanks
Thanks for this. I overly simplified my example and requirements for the sake of clarity but it looks like the XLST approach will handle them very well!