Question: Getting Pairwise Sequence Distances From The Muscle Command-Line
2
6.3 years ago by
Will4.5k
United States
Will4.5k wrote:

I'm trying to automate (and improve) a computational 'procedure' in my new lab (of traditional biologists) and I'm having trouble replicating a step.

At one point in an analysis they use the GUI version of MUSCLE to align a set of sequences and then generate a UPGMA tree. Then they save the pairwise distance matrix for downstream analysis. I've already automated the downstream and upstream processes but I'm having trouble with this step.

The MUSCLE command line doesn't have an option for returning the pairwise distances (only the final tree). Is there a way to get those distances out? Or is there a way to extract the distances from the tree (a simple googleing didn't reveal anything)? Or is there a way to calculate those distances myself (the muscle manual doesn't give much help explaining how it arrives at the distances)? Or any other suggestions? I really don't want to resort to some silly 'automated GUI interaction' just to get this output data.

Thanks,

Will

alignment • 4.0k views
written 6.3 years ago by Will4.5k
2
6.3 years ago by
Will4.5k
United States
Will4.5k wrote:

In case anyone else stumbles across this later, here's the answer I came up with:

I used the Biopython toolbox to read the tree-file created by the `-tree2` option and then the return the branch-lengths between all pairs of terminal nodes:

``````from Bio import Phylo
from itertools import combinations

seq_names = tree.get_terminals()
dmat = {}
for p1, p2 in combinations(seq_names, 2):
d = tree.distance(p1, p2)
dmat[p1.name, p2.name)] = d
dmat[p2.name, p1.name)] = d
``````
0
6.3 years ago by
Josh Herr5.6k
Josh Herr5.6k wrote:

I would think there would be a few ways to do this. I don't have a clear answer, but here's a couple of options:

1. Could you parse a FASTA alignment out from MUSCLE to RAxML (github)? In MUSCLE you can parse out a fasta file with the `-fasta` flag. You'd be able to easily compute a distance matrix from there. I'm not sure how you are writing your pipeline (bash, etc.) but I would think this might be fairly straightforward and you could send it to other downstream analyses as that is your end goal.

2. I'm not too up to speed with sequence data in R, but if you can output (the output flags in MUSCLE are `-clw`, `-clwstrict`, `-html`, and `-msf`) in a format that can be converted to a matrix, this can be done in R with the `dist( )` command and then sent back into your downstream analysis pipeline.

3. I'm sure there are other solutions... If you've exhausted them all, you can always contact Robert Edgar. He's been great when dealing with my questions and has always been extremely prompt at his replies to any questions about his programs (UCLUST is where most of my questions have been centered).

Let us know what you figure out!

Looks like you beat me to it! Thanks for posting your solution!

None of those would work for this use-case ... I needed the matrix of distances between sequences not the 'alignment matrix'. None of the output methods include distances except the `-tree`. And that requires some conversion to get the values out.