Question: Phylogenetic Distance From Incomplete Dataset
2
8.9 years ago by
Alf450
UK
Alf450 wrote:

I have a set of organisms, and I want an approximation of the distance matrix between them (don't need the tree). My plan were taking several COGs and concatenating them together as single sequences, which, after alignment would give me the p-distance (or any other distance derived from it).

The problem is that in the COG dataset I am using, there is no "universal COG". That is to say, there are few organisms which are left out for every COG. One option is to ignore the organisms which are left out in at least one COG and work with the rest. Another option, which I've just thought is to build a distance matrix for each separate COG, and finally compute the average matrix of all the n individual results. Obviously, if a distance is not defined in a matrix (for example) I would add 0.0 to the sum, and divide by n-1 instead of by n.

I think the solution is very naive, but I still have a question. Do you think this approach is trustable? Is it a standard thing (in case it is, could you give a reference where it is used)? Do you propose other alternative? Note that I want an approximation of the distance, not a perfectly computed tree.

phylogenetics distance • 1.9k views
written 8.9 years ago by Alf450
1

What kind of organisms do you have? How closely related are they?

1

I'm assuming you mean Three domains? (Eukaryotes, Eubacteria, and Archea) If this is the case using COGs may be problematic as COG is built around bacterial representation. You could use OrthoMCL definitions but they are split on a finer scale than you may want when it comes to co-orthologs, inparalogs, etc. If you meant three kingdoms as in Plants, Metazoa, and Fungi they aren't THAT unrelated. I'd recommend Homologene if that is the case.

Many organisms (>300, including all classical model organisms), from the three kingdoms (therefore, very unrelated)

I mean domains, yes. In Ciccarelli et al, "Toward automatic reconstruction of a highly resolved tree of life", Science. 2006 May 5;312(5774):697, they use a set of 31 COGs (not KOGs) for building a "universal" tree of life for around 100 species. The tree seems to be very accurate compared with previous findings and beliefs and it's a highly cited paper (>500). What I basically want to do is the same, maybe not being so accurate (a rough approximation should be good). My problem is that there is no single COG covering all the species of my dataset, and there goes my initial question :)...

Michael, basically all STRING core species :).

Lack of a Universal COG isn't a problem, it is dealt with in Phylogenomic analyses all of the time currently.

1
8.9 years ago by
DG7.1k
DG7.1k wrote:

I think the solution you propose is reasonable, as it is an approximation. It's really not that different from how joint estimation of branch lengths is done with full trees on concatenated alignments/supermatrices where you have missing data.

Depending on the distance metric you want to use, it may already handle missing data if you concatenate all the sequences together, using gap characters where a taxon doesn't have a gene as part of that COG.

1
8.9 years ago by
Lyco2.3k
Germany
Lyco2.3k wrote:

Is there a particular reason why you want to use COGs ? Usually, when people want to make spcies trees they focus on ribosomal RNAs, which are present in all organisms and are clearly related and alignable, even over large evolutionary distances. Actually, there are quite a few rRNA databases that server mainly this purpose (http://www.arb-silva.de or http://rdp.cme.msu.edu)

There are situations where using protein is appropriate. If you strictly need a protein-based distance matrix, you should focus on proteins that are found everywhere, e.g. ribosomal core subunits.

1

Why would it be difficult to get the rRNA sequence, or why should it be harder to get the rRNA than to get a protein sequence? The rRNA are by far the easiest message to detect. Or are you talking about metagenomics data? But then, your multi-COG approach would also be impossible.

1

I'm not sure why you would have a protein fasta file where you don't know what species the sequence belongs to. If it is public data and it just happens to be missing from the file you obtained/were given a simple BLASTP search to find the identical record in NCBI is trivial and would give you the taxonomic assignment.

As for Lyco's question of why to use COGs in the first place versus the rRNA distance, COGs seem a natural choice for ortholog clusters in bacteria, and doing a distance based on a concatenated set of COGs would give you a more robust distance in a more phylogenomic context.

1

Generate clusters of homologous sequences, pre-build trees using something like FastTree or RAxML and go from there. Either way you're going to have to do some sort of profile based searching anyway to figure out which clusters of sequences the user inputed sequences match, and calculate p-distances just for those matches and average them if you don't want to make trees.

1

If you're adding a new species to a set of reference species, I'd use a given, more exact tree (like the Cicarelli tree) and figure out the closest neighbor in this tree and use this a proxy. Building trees is hard, e.g. see this nice review: http://www.biology-direct.com/content/6/1/32

The idea is to add new species afterwards, in many cases being unknown (just add the fasta file). So, getting the rRNA is not easy, but I can get a COG if there is a mutual best match in the COG database (again approximately)... It is probable that there is not a good candidate for some of the COGS, so the new guy would only have distances in some of the matrices...

It's not metagenomics, by sure. The scenario is the following: imagine I got a protein fasta file, and I don't know which specie it belongs to. Maybe I am missing something, but how do I get the rRNA? Sorry for the newbie questions :)