Andra's idea to use biostarlet turned out to be extremely helpful. I must admit that I had problems understanding the documentation at http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html
but Andra's example on parsing BLAST output and some trial & error did the job. Mostly, that is.
I would like to summarize what I learned about the use of xmlstarlet, because it might be useful for other dummies like me, who are facing similar problems. For that purpose, let us assume a simple XML file (inspired by the structure of LOCATE, but greatly simplified)
<TEST_doc xmlns:xsi="<a href=" http:="" www.w3.org="" 2001="" XMLSchema-instance"="" rel="nofollow">http://www.w3.org/2001/XMLSchema-instance">
<ENTRY uid="123456">
<protein>
<name>PROT001</name>
<organism>Human</organism>
<class>cytoplasmic</class>
</protein>
<xrefs>
<xref>
<database>Ensembl</database>
<accn>ENSG00000105829</accn>
</xref>
</xrefs>
</ENTRY>
<ENTRY uid="45678">
<protein>
<name>PROT002</name>
<organism>Human</organism>
<class>nuclear</class>
</protein>
<xrefs>
<xref>
<database>Ensembl</database>
<accn>ENSG00000105333</accn>
</xref>
</xrefs>
</ENTRY>
</TEST_doc>
This XML file describes two database entries (each of them bracketed by <ENTRY> </ENTRY>
). Each entry has a name, an associated organism and a 'class' telling us something about the localization of the protein. Each entry also contains a section called <xrefs>
which might contain cross-references to other databases. In this simple example, there is only one such xref per entry.
For following Andra's suggestion, I need the xmlstarlet command-line program, which I could install under Ubuntu by apt-get install xmlstarlet.
For getting a first overview of the XML structure, I have tried
cat simple.xml | xmlstarlet el
which gave the following output (XML structure without data):
TEST_doc
TEST_doc/ENTRY
TEST_doc/ENTRY/protein
TEST_doc/ENTRY/protein/name
TEST_doc/ENTRY/protein/organism
TEST_doc/ENTRY/protein/class
TEST_doc/ENTRY/xrefs
TEST_doc/ENTRY/xrefs/xref
TEST_doc/ENTRY/xrefs/xref/database
TEST_doc/ENTRY/xrefs/xref/accn
TEST_doc/ENTRY
TEST_doc/ENTRY/protein
TEST_doc/ENTRY/protein/name
TEST_doc/ENTRY/protein/organism
TEST_doc/ENTRY/protein/class
TEST_doc/ENTRY/xrefs
TEST_doc/ENTRY/xrefs/xref
TEST_doc/ENTRY/xrefs/xref/database
TEST_doc/ENTRY/xrefs/xref/accn
The next step was to extract the data fields that are of interest to me. With xmlstarlet, this can apparently be done by
cat simple.xml | xmlstarlet sel -t -m //ENTRY -v "concat(field1,' ',field2)" -n
where 'sel -t' tells the program to go into field extraction mode,
'-m //ENTRY' means that the subsequent part is applied to each ENTRY (I have no idea what the initial slashes are good for), and the last part indicates the fields that are to be extracted (-n generates a linebreak after each entry). The syntax of the field specification is special: it appears to be a 'relative path' from the bit that has been matched by the -m option. For example, ./protein/name means TEST_doc/ENTRY/protein/name (relative to ENTRY)
The command
cat simple.xml | xmlstarlet sel -t -m //ENTRY -v "concat(./protein/name,' ',./protein/class,' ',./xrefs/xref/database,' ',./xrefs/xref/accn)" -n
results in the output
PROT001 cytoplasmic Ensembl ENSG00000105829
PROT002 nuclear Ensembl ENSG00000105333
which is exactly what I need.
BUT, there remains a little problem. If the XML is a little more complex than the above example, e.g. because each entry has two xref sections pointing to different databases:
<TEST_doc xmlns:xsi="<a href=" http:="" www.w3.org="" 2001="" XMLSchema-instance"="" rel="nofollow">http://www.w3.org/2001/XMLSchema-instance">
<ENTRY uid="123456">
<protein>
<name>PROT001</name>
<organism>Human</organism>
<class>cytoplasmic</class>
</protein>
<xrefs>
<xref>
<database>Ensembl</database>
<accn>ENSG00000105829</accn>
</xref>
<xref>
<database>UNIPROT</database>
<accn>Q12345</accn>
</xref>
</xrefs>
</ENTRY>
<ENTRY uid="45678">
<protein>
<name>PROT002</name>
<organism>Human</organism>
<class>nuclear</class>
</protein>
<xrefs>
<xref>
<database>Ensembl</database>
<accn>ENSG00000105333</accn>
</xref>
<xref>
<database>UNIPROT</database>
<accn>Q14789</accn>
</xref>
</xrefs>
</ENTRY>
</TEST_doc>
In this case, the same xmlstarlet call will retrieve only the first xref it encounters:
cat complex.xml | xmlstarlet sel -t -m //ENTRY -v "concat(./protein/name,' ',./protein/class,' ',./xrefs/xref/database,' ',./xrefs/xref/accn)" -n
results in the output
PROT001 cytoplasmic Ensembl ENSG00000105829
PROT002 nuclear Ensembl ENSG00000105333
and the xref to the uniprot database is not listed in the output.
After doing some experiments, I found that this problem can be circumvented by matching on 'xref' rather than the more intuitive 'ENTRY'. When specifying the reported fields relative to the match, it is possible to 'go backwards' by using the unix-style '../..' paths.
the command
cat complex.xml | xmlstarlet sel -t -m //xref -v "concat(../../protein/name,' ',../../protein/class,' ',./database,' ',./accn)" -n
gave me the desired output
PROT001 cytoplasmic Ensembl ENSG00000105829
PROT001 cytoplasmic UNIPROT Q12345
PROT002 nuclear Ensembl ENSG00000105333
PROT002 nuclear UNIPROT Q14789
In the real LOCATE database, things are even more complicated, as each entry can have multiple xrefs and also multiple predictions. I guess that in those cases, I will have to do two or more xmlstarlet extractions and combine the resulting tables later by some sed/awk/join magic.
Anyway, I consider my problem solved. Many thanks to all who have answered!
@Lycopersicon. If you have to write a table, tab delimited is fine. But if you have to write the output of a blast, tab delimited is terrible! XML, on contrary, can do it. However, you could write the XML file wrong (forgetting a [?] would make parser go mad), so you can "validate" the XML to make sure it conforms to what it should look like (in case you make your own blast output, for instance). If it is "valid" everybody knows how to handle it. If tab delimited you forget a tab or put two in a row....
You mean "into another decent parseable format"? Because, provocation aside, an XML document format (with a specific XML-Schema) is a decent parseable (even automatically parseable) format. What you didn't know, is that you tab-delimited output 'format' is actually not a format! That said, this is the main advantage of XML, it allows to specify flexible complex document formats against which the validity of a document can be automatically validated, documents can be parsed and transformed. For the solution Pierre's answer is also demonstrating the advantages of XML.
Hey, this line with the 'decent format' was supposed to be a joke!. But at least I got your attention. Pierre's approach looks interesting and I will see if it gets me anywhere. And sorry if I called 'tab-delimited files' a format. Please remember, I am a biologist, so you have to talk SLOWLY!
I also fail to appreciate the importance of 'validation' that the XML folks talk so much about. To me, XML looks so obfuscated that it NEEDs a validation, while with a simple format you can see right away if its ok. But this is probably just because I don't understand what validation is. Sigh.
@Stefano, for the last 10 years, I have been living happily with blast -m8 (you know what I mean)