Question: Map gene gain and loss in phylogenetic tree
3
4.0 years ago by
dago2.6k
Germany
dago2.6k wrote:

I calculate the pangenome for a set of bacterial genomes I am working with. I have a classical pangenome matrix that looks like this:

``````           speciesA    speciesB ... speciesZ
gene1   1            2                 0
gene2    0            1                 0
gene3    1             1                1
gene4    1             1                2
``````

Now I would like to map the gen gain and loss in a phylogenetic tree. Something like this:

I guess I could just record all families of genes shared among the different species in each branch of the tree, but I believe there is a more elegant way of doing that.

Any suggestion?

modified 3.9 years ago • written 4.0 years ago by dago2.6k
5
4.0 years ago by
Leo Martins220
Lausanne, Switzerland
Leo Martins220 wrote:

I guess you are looking for a software like Count, that performs ancestral reconstruction of gene family sizes over a given tree. But as abascalfederico mentioned, you should be very careful with the possibility of HGT.

HI thanks for the answer. Count looks really cool. However, for my understanding it calculates rates of gain, loss, and duplication. I would be interested in displaying the absolute number of gene families gained or loss. Any other suggestion?

1

Count can give the actual ancestral family sizes, not only the rates. I am not sure I understand what you mean by "absolute number of gene families gained or los[t]". Assuming you want the number of events per branch, then 1) use Count to have the family sizes per gene and then label the appropriate branches (or nodes (*)) as "gain" (family size from zero to non-zero) or "loss" (from non-zero to zero). Therefore, for one gene, each branch can be or a "gain", or a "loss", or nothing. 2) After doing this for all genes using a common tree, go at each branch of this common tree and sum up the number of "gains" (over genes) and you will have the number of gains per branch. (The same applies to losses.)

You can also use other the methods, as suggested by Federico, which may be also available in Count.

(*) For rooted trees there is a one-to-one correspondence between branches and nodes, but we usually interpret the events as happening "somewhere" in the branch.

Thanks very much again for your reply. So basically what you suggest is to go "manually" from one branch (node) to the next and compare if each gene family varies from 0 to > 0 or from > 0 to 0, is that right? For example from node 26 to 25 Family_1 goes from 0 to 1 so it is a "gain". I wondered if there is a tool which does that.

The COUNT is vary cool /. Recently, I focus on the gene family evolution , when using the COUNT software, I don't know how to choose the optimize model (gain and loss ,BDI, or something in the Optimize Rate Panel ), could you give me some suggestion ?

1
4.0 years ago by
abascalfederico1.1k
Spain
abascalfederico1.1k wrote:

The best way to do this (and the only way I am aware of) is with a phylogenetic tree. Once the phylogenetic tree is reconstructed for each family of genes, you can compare that tree with the underlying species tree, allowing you to identify gene duplications and losses. There are available tools to do this. However, horizontal transfer of genes is frequent in bacteria, which will make the inference of gains/losses less reliable.

thanks for the answer. I found in few paper that they "map" the presence/absence of gene families on a provided tree using Deltren option in PUAP. some other use custom scripts. This let me think that having a species tree and a pangenome matrix I should be able to get where I want. How, I do not know yet.

1

If you just want to know whether a family is present or absent, ignoring gene duplications, I guess you may be able to get something from your matrix. The matrix should be transformed to 0/1 binary characters (absence/presence), and you could identify the origin of the family with DOLLO parsimony. Then traverse the tree from the root of the family and identify nodes whose descendants have all a 0 character. That would be a lose event.

But this is complicate to implement - the program Leo suggested will probably do the job!