Question

Mzidentml'S Non-Mapping Obo

3

Entering edit mode

13.7 years ago

Jesse J ▴ 150

Under my Professor's suggestion, I created a pepXML to mzIdentML converter in Ruby. But as I tried mapping pepXML to mzIdentML terms I came across some challenges. I've come up with solutions to those challenges, but I don't know why they exist in the first place.

Why is it that pepXML doesn't have a one-to-one mapping with the mzIdentML terms listed in the OBO? For example, one of Mascot's scores is called ionscore in pepXML, but called Mascot:score in mzIdentML.

Why is it that some scores are not listed at all in the OBO? Like X! Tandem's yscore or bscore?

Why is the OBO used in the first place? Having to update it whenever a new term is needed can be a pain.

format • 3.7k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Jesse J ▴ 150

2

Entering edit mode

It helps because people then can look the the file formats and may recognize similarities with other past problems that they've solved.

ADD REPLY • link 13.7 years ago by Istvan Albert 100k

0

Entering edit mode

I don't know those two formats. Can you edit your question and add some samples for those two files ?

ADD REPLY • link 13.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

@Pierre I guess so, but I don't see how that would help.

ADD REPLY • link 13.7 years ago by Jesse J ▴ 150

Ram · Answer 1 · 2010-07-30

Sorry, the current situation of proteomics data formats is quite bad, as Paulo Nuin already commented. Having one common standard format per task would be a great step forward.

Regarding the OBO file: Such a controlled vocabulary is really necessary. For XML files, you can define the structure (XSD files) - you do not want to have that structure changed when you have a new score added. Therefore, defining attributes of nodes is the way to specify - for example - the type of score of a peptide. If you would not have a controlled vocabulary, one person would name it "Mascot:score", another "mascot-score", or just "score".

To solve your problem with scores which are not yet in the OBO:

[Term]
id: JJ:000001
name: xtandem:yscore
def: "The X!Tandem result 'yscore'." [PSI:PI]
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001116 ! single protein result details
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

Write your own obo file, e.g. with an entry like the one above (and parse it).
Request new CV terms to the PSI-MS Controlled Vocabulary.

This comment in Deutsch et al., Proteomics 2010 might help a bit to clarify about the pepXML disambiguation. (Though it does not improve the current situation)

The PSI proteomics informatic group, also with cooperation from the TPP development team, has developed a new format that can encode downstream informatic analysis of proteomics MS data. The new format, mzIdentML, combines the information encoded in pepXML and protXML and much more into a single file format (except for quantification information, which is expected from a subsequently released mzQuantML format). Although it is likely that the TPP will continue to use pepXML and protXML as internal working formats for the pipeline, in the future the TPP will convert all the final results to mzIndentML once its development is complete.

For now, anyway, you need (your own?) mapping of pepXML terms to CV terms.

Hope that helps.

Ram · Answer 2 · 2010-07-27

1

Entering edit mode

13.7 years ago

Paulo Nuin ★ 3.7k

Welcome to the world of proteomics data formats. I can feel your pain, but I don't know if I'm able to help you overcome your problems.

The main factor here is that both formats were created by different groups, pepXML by the TPP people and mzIdentML by the Proteomics Standards (HUPO) group. From Mascot's documentation you can get some idea about pepXML:

The pepXML format is only applicable to MS/MS search results, and represents "raw" peptide match data. Information is exported for all matches to all queries, (MS/MS spectra). For each match, extensive information is provided for the first protein in which the peptide is found and more limited information for all the other proteins. This can make the output file very large.

As you could see, pepXML can only handle MS/MS data, while, on the other hand, mzIdentML is (and will be) able to handle other types of data. mzIdentML is still under development.

Also, in Science, any area, different groups developing things in parallel (or not) will have different answers to the same problem. In proteomics this is not different, and this is just the tip of the iceberg of proteomics data formats. You ask "why the OBO is used", that's because one group created the format and then created the software to handle it; maybe someone didn't create the program to translate it.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

So what you're saying is, because it's a different group, they decided to do things their own way, even if it means making the transition from pepXML to mzIdentML difficult?

ADD REPLY • link 13.7 years ago by Jesse J ▴ 150

0

Entering edit mode

Yes and no.

There are a myriad of formats in proteomics (pepXML, Mascot's output, Protein Pilot output, mzXML, protML, etc) and some of the groups are interested in different aspects of data that they cover one area but not the other.

See here for an example of some of the formats. Several times I had problems converting from one format to another or just losing data in the conversion.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

Sorry that my answer was a little bit of letdown, but this is the truth about the field.

ADD REPLY • link 13.7 years ago by Paulo Nuin ★ 3.7k

Ram · Answer 3 · 2010-11-11

0

Entering edit mode

13.4 years ago

Brianbalgley ▴ 110

Haven't tried it, but you might want to take a look:

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.4 years ago by Brianbalgley ▴ 110

Ram · Answer 4 · 2011-04-16

PepXML was developed years before mzIdentML was a glint in anyone's eye. And then once it was a glint (i.e. going by the old name analysisXML), it took ~5 years for it to get to 1.0. Developing a standard format for proteomic search results was hard! An update to the format is in development now (1.1) and your input is welcome at:

Email

I wrote bidirectional serialization for pepXML and mzIdentML for ProteoWizard and ran into the same issues you have but was able to work around them for the most part. Of course the serialization is lossy for some of the metadata, but the main stuff gets through. Pwiz has robust support for the OBO CVs though, which takes considerable work by itself (but as Florian said, the benefits are worth it). Just look at mzData/mzXML and how many ways there are to write "ion trap" or "LTQ" in it.

For the things I can't work around, I contact their working group and pester them to fix it. ;)