Question

Mismatch between gene IDs in Map-SeqNum-ID.txt and HierarchicalGroups.orthoxml

0

Entering edit mode

5.0 years ago

eschang1 ▴ 10

Hi there,

Thanks so much for all of your help so far. I recently completed a successful OMA standalone run and am now digging into the results using the PyHam python package. One thing that I have noticed so far is that there seems to be a mismatch between the gene IDs as laid out by OMA in the Map-SeqNum-ID.txt file and the gene IDs stored in the HOG orthoXML file.

For example, I have been trying to access information about a particular sequence listed as below in the mapping file:

branchiostoma_floridae    8460    XP_002593948.1 hypothetical protein BRAFLDRAFT_98245 [Branchiostoma floridae]

I have loaded my species tree and OrthoXML file into PyHam and have tried querying by gene ID, like so:

gene_8460 = metazoa_ham.get_gene_by_id(8460)
print(gene_8460.get_dict_xref())

This returned: {'id': '8460', 'protId': 'oki.206.4.t1’} which is obviously not the gene I was actually trying to query.

I tried searching in reverse to cofirm, i.e. using the external gene name:

test_gene = metazoa_ham.get_genes_by_external_id('XP_002593948.1 hypothetical protein BRAFLDRAFT_98245 [Branchiostoma floridae]')[0]
print(test_gene.get_dict_xref())

Which returned: {'id': '137320', 'protId': 'XP_002593948.1 hypothetical protein BRAFLDRAFT_98245 [Branchiostoma floridae]'}

So it appears that the external ID matches what is expected, but this sequence is stored as 8460 in the Mapping list and 137320 in the OrthoXML. I tested several other sequences in this way and had similar results.

My main question is: Is ID mismatch a symptom of something having gone wrong during the run, or is it an expected behavior? As long as I have some way of accurately querying some sequences of interest to get their root level HOGs etc. should I not be worried about this?

Additionally, is there some way get PyHam to write out a list of OrthoXML gene IDs and their associated external IDs so I don't always need to use the cumbersome external sequence names?

Thank you once again!

Sally Chang

oma orthology pyham • 1.2k views

ADD COMMENT • link updated 5.0 years ago by Adrian Altenhoff ★ 1.1k • written 5.0 years ago by eschang1 ▴ 10

score 1 · Answer 1 · 2019-05-02

Dear Sally,

the difference is the expected behaviour. Note that in the Map-SeqNum-ID.txt file, the numeric column (8460 in your example) starts from 1 for every genome. So the 8460 is not a unique ID. The id attribute in the <gene> tag of the orthoxml file however must be unique.

About your question to avoid querying the external IDs, I'm not sure what you mean exactly. You can always access the prot_id property of a Gene object. If you mean that python reports the external id instead of the unique id if you print a gene, you could overwrite the __repr__ and/or __str__ method of Gene, for example:

def new_repr(self):
     return "Gene(id={}, prot_id={})".format(self.unique_id, self.prot_id)

pyham.Gene.__repr__ = new_repr

and then print(gene_8460) should actually return Gene(id=8460, prot_id=XP_002593948.1 hypothetical protein BRAFLDRAFT_98245 [Branchiostoma floridae])

Best wishes, Adrian