7.0 years ago by
You need to realize that there is by definition no way to avoid getting paralogs in the problem you state.
Imagine that you have an ancestral species (A) in which you have one gene (A1) from a some family. Through a speciation event A becomes the species B and C, which each still have one copy of the gene each (B1 and C1). Next you have a gene duplication event in C so that you now have two copies of the gene (C1 and C2), which subsequently over long times diverge in function. It is important to realize here that C1 and C2 are completely equal - it may just as well be C2 that has the ancestral function as it may be C1. Through yet another speciation event, species C becomes D and E, which each have two paralogous genes D1/D2 and E1/E2.
You now find yourself in the situation that you have three extant organisms B and D and E, with the genes B1, D1, D2, E1, and E2. If you trace their origin, they all derive from A1 through a speciation event. D1, D2, E1, and E2 are thus all orthologs of B1, despite D1 and D2 being paralogs.
There is no way to "fix" this, because it is not an error. It is the reality. The gene A1 does not have a one-to-one ortholog in D or E. The gene E2 is both an ortholog of B1 and a paralog of D1! If you were to remove it to avoid paralogs, you would be removing one of the correct orthologs. And there is no guarantee that E1 is the one with ancestral function whereas E2 has taken on some different function. It could just as well be the other way around. The ancestral function may even have been divided among the two copies that each do part of it (this is known as subfunctionalization).
Since you want to have orthologs across all of bacteria, you need orthologous groups defined with respect to the last common ancestor of all bacteria. All subsequent gene duplications will lead to genes that - like in the example above - are at the same time paralogs of some genes and orthologs of others. You simply have to accept to live with there being paralogs in your set. It is not a flaw of the COG database. Paralogs are an unavoidable logical consequence of dealing with orthology across multiple species.