I have a set of organisms, and I want an approximation of the distance matrix between them (don't need the tree). My plan were taking several COGs and concatenating them together as single sequences, which, after alignment would give me the p-distance (or any other distance derived from it).
The problem is that in the COG dataset I am using, there is no "universal COG". That is to say, there are few organisms which are left out for every COG. One option is to ignore the organisms which are left out in at least one COG and work with the rest. Another option, which I've just thought is to build a distance matrix for each separate COG, and finally compute the average matrix of all the n individual results. Obviously, if a distance is not defined in a matrix (for example) I would add 0.0 to the sum, and divide by n-1 instead of by n.
I think the solution is very naive, but I still have a question. Do you think this approach is trustable? Is it a standard thing (in case it is, could you give a reference where it is used)? Do you propose other alternative? Note that I want an approximation of the distance, not a perfectly computed tree.