Question: Getting Pairwise Sequence Distances From The Muscle Command-Line
2
gravatar for Will
6.7 years ago by
Will4.5k
United States
Will4.5k wrote:

I'm trying to automate (and improve) a computational 'procedure' in my new lab (of traditional biologists) and I'm having trouble replicating a step.

At one point in an analysis they use the GUI version of MUSCLE to align a set of sequences and then generate a UPGMA tree. Then they save the pairwise distance matrix for downstream analysis. I've already automated the downstream and upstream processes but I'm having trouble with this step.

The MUSCLE command line doesn't have an option for returning the pairwise distances (only the final tree). Is there a way to get those distances out? Or is there a way to extract the distances from the tree (a simple googleing didn't reveal anything)? Or is there a way to calculate those distances myself (the muscle manual doesn't give much help explaining how it arrives at the distances)? Or any other suggestions? I really don't want to resort to some silly 'automated GUI interaction' just to get this output data.

Thanks,

Will

alignment • 4.3k views
ADD COMMENTlink written 6.7 years ago by Will4.5k
2
gravatar for Will
6.7 years ago by
Will4.5k
United States
Will4.5k wrote:

In case anyone else stumbles across this later, here's the answer I came up with:

I used the Biopython toolbox to read the tree-file created by the -tree2 option and then the return the branch-lengths between all pairs of terminal nodes:

from Bio import Phylo
from itertools import combinations

tree = Phylo.readopentree_handle.name), 'newick')
seq_names = tree.get_terminals()
dmat = {}
for p1, p2 in combinations(seq_names, 2):
    d = tree.distance(p1, p2)
    dmat[p1.name, p2.name)] = d
    dmat[p2.name, p1.name)] = d
ADD COMMENTlink written 6.7 years ago by Will4.5k
0
gravatar for Josh Herr
6.7 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

I would think there would be a few ways to do this. I don't have a clear answer, but here's a couple of options:

  1. Could you parse a FASTA alignment out from MUSCLE to RAxML (github)? In MUSCLE you can parse out a fasta file with the -fasta flag. You'd be able to easily compute a distance matrix from there. I'm not sure how you are writing your pipeline (bash, etc.) but I would think this might be fairly straightforward and you could send it to other downstream analyses as that is your end goal.

  2. I'm not too up to speed with sequence data in R, but if you can output (the output flags in MUSCLE are -clw, -clwstrict, -html, and -msf) in a format that can be converted to a matrix, this can be done in R with the dist( ) command and then sent back into your downstream analysis pipeline.

  3. I'm sure there are other solutions... If you've exhausted them all, you can always contact Robert Edgar. He's been great when dealing with my questions and has always been extremely prompt at his replies to any questions about his programs (UCLUST is where most of my questions have been centered).

Let us know what you figure out!

ADD COMMENTlink written 6.7 years ago by Josh Herr5.6k

Looks like you beat me to it! Thanks for posting your solution!

ADD REPLYlink written 6.7 years ago by Josh Herr5.6k

None of those would work for this use-case ... I needed the matrix of distances between sequences not the 'alignment matrix'. None of the output methods include distances except the -tree. And that requires some conversion to get the values out.

ADD REPLYlink written 6.7 years ago by Will4.5k

Yes, I understand that there is no distance matrix output for MUSCLE so my response was to suggest two ways to compute a distance matrix. Thanks for your Python solution!

ADD REPLYlink written 6.7 years ago by Josh Herr5.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1777 users visited in the last hour