HOG without numbering in the genome of the ancestor reconstructed using the pyHAM
1
0
Entering edit mode
4 months ago

Dear colleagues,

Recently I successfully complete OMA standalone run and now I am trying to analyze results using the pyHam python package. The main task for me is to reconstruct the genome of the ancestor of the analyzed species. After executing the following commands, I get information that in the model of the ancestor genome are 10372 genes:

ham_analysis = pyham.Ham(nwk_file, orthoxml_file, use_internal_name=True)

ancestral_genome = ham_analysis.get_ancestral_genome_by_name(ancestral_genome_name)

ancestral_genes = ancestral_genome.genes

print(len(ancestral_genes))

I am a little alarmed by two facts: 1) In ancestral_genes, many elements (HOG) have this designation <HOG ()>, and do not have a number, in contrast to, for example, <HOG (17899)> or <HOG (17900)>. Is it mistake if previously the sequences assigned to <HOG ()> were included in numbered orthogroups? Or am I missing something? The same designations are found even in the tutorial (https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html) [In [37]].

2) However, if we extract information about the descendant genes in these <HOG ()>, we will see that some of these groups of orthologs contain only 1 or 2 sequences of modern species. Correct me, please, if I am mistaken, but how can it be considered that the ancestor had a gene if it is present only in 1 or 2 species out of, for example, 11 analyzed? Is it worth specifying a certain threshold for the number of modern species in which a gene must be present to reconstruct the ancestor's genome? And how to do it most efficiently and correctly?

I would be grateful for any help!

HOG OMA ancestor pyHAM genome • 228 views
0
Entering edit mode
6 weeks ago

Dear Maksim,

Point #1: Only root HOGs are labelled with ID in the orthoxml. While printing a root HOG will display <HOG (17900)> printing a sub HOG (an HOG at a lower taxonomic range) will display <HOG ()>.

If you need to identifying all HOGs with unique ids then you should use built-in python object id

id(hog_object)


Point #2: Without having the reference species tree I can't help more about this 2/11 species coverage but what I can say is that such scenario is possible. it usually better to think in number of gene loss rather that # of species not covered by the HOG.

Nevertheless, in the OMA parameter file you can change the threshold for merging HOGs; in your case you can see that as as much HOGs need to be complete in term of species coverage to be clustered together.

MinEdgeCompletenessFraction := 0.65;