Getting Pairwise Sequence Distances From The Muscle Command-Line
2
2
Entering edit mode
11.1 years ago
Will 4.5k

I'm trying to automate (and improve) a computational 'procedure' in my new lab (of traditional biologists) and I'm having trouble replicating a step.

At one point in an analysis they use the GUI version of MUSCLE to align a set of sequences and then generate a UPGMA tree. Then they save the pairwise distance matrix for downstream analysis. I've already automated the downstream and upstream processes but I'm having trouble with this step.

The MUSCLE command line doesn't have an option for returning the pairwise distances (only the final tree). Is there a way to get those distances out? Or is there a way to extract the distances from the tree (a simple googleing didn't reveal anything)? Or is there a way to calculate those distances myself (the muscle manual doesn't give much help explaining how it arrives at the distances)? Or any other suggestions? I really don't want to resort to some silly 'automated GUI interaction' just to get this output data.

Thanks,

Will

alignment • 6.7k views
2
Entering edit mode
11.1 years ago
Will 4.5k

In case anyone else stumbles across this later, here's the answer I came up with:

I used the Biopython toolbox to read the tree-file created by the -tree2 option and then the return the branch-lengths between all pairs of terminal nodes:

from Bio import Phylo
from itertools import combinations

tree = Phylo.readopentree_handle.name), 'newick')
seq_names = tree.get_terminals()
dmat = {}
for p1, p2 in combinations(seq_names, 2):
d = tree.distance(p1, p2)
dmat[p1.name, p2.name)] = d
dmat[p2.name, p1.name)] = d

0
Entering edit mode
11.1 years ago
Josh Herr 5.8k

I would think there would be a few ways to do this. I don't have a clear answer, but here's a couple of options:

1. Could you parse a FASTA alignment out from MUSCLE to RAxML (github)? In MUSCLE you can parse out a fasta file with the -fasta flag. You'd be able to easily compute a distance matrix from there. I'm not sure how you are writing your pipeline (bash, etc.) but I would think this might be fairly straightforward and you could send it to other downstream analyses as that is your end goal.

2. I'm not too up to speed with sequence data in R, but if you can output (the output flags in MUSCLE are -clw, -clwstrict, -html, and -msf) in a format that can be converted to a matrix, this can be done in R with the dist( ) command and then sent back into your downstream analysis pipeline.

3. I'm sure there are other solutions... If you've exhausted them all, you can always contact Robert Edgar. He's been great when dealing with my questions and has always been extremely prompt at his replies to any questions about his programs (UCLUST is where most of my questions have been centered).

Let us know what you figure out!

0
Entering edit mode

Looks like you beat me to it! Thanks for posting your solution!

0
Entering edit mode

None of those would work for this use-case ... I needed the matrix of distances between sequences not the 'alignment matrix'. None of the output methods include distances except the -tree. And that requires some conversion to get the values out.

0
Entering edit mode

Yes, I understand that there is no distance matrix output for MUSCLE so my response was to suggest two ways to compute a distance matrix. Thanks for your Python solution!