Question

Use NCBI guide tree in Clustalw

1

Entering edit mode

8.8 years ago

cara78 ▴ 10

I want to use a guide tree I got in NCBI common taxonomy tree to use with my MSA in clustalw to produce an alternative arrangement of the MSA based on the NCBI guide tree.

The guide tree from NCBI I got looks like this,

(
'synthetic construct':4,
'unclassified sequences':4,
'Paramecium bursaria Chlorella virus 1':4,
(
(
(
'Picomonas judraskeda':4,
'Palpitomonas bilix':4,
'Metromonas simplex':4
)'unclassified eukaryotes':4,
(
'Rhizomastix libera':4,
(
'Stygamoeba regulata':4,
..
..
..

I tried using it but I get the an error "ERROR: tree". with no other information as to what maybe wrong.

My command I used is

clustalw2 -INFILE=names.fasta -USETREE=ncbi_guide_tree.phy -OUTFILE=unique_euka.fasta -OUTPUT=FASTA

Does anyone please have any suggestions as what my be wrong?

ncbi clustalw MSA guide-tree • 2.7k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by cara78 ▴ 10

Ram · Answer 1 · 2015-06-30

0

Entering edit mode

8.8 years ago

blackgore ▴ 60

Without seeing the input sequence data or the full tree file, this is just a guess, but since your data is coming from two sources, it's possible that there's a mix-up in the naming - clustalw may not be able to find your input sequences in your guide tree. Do the sequence names in the guide tree exactly match the sequence names in your input file? Does your guide tree contain many more names than there are in the input sequence file?

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by blackgore ▴ 60

0

Entering edit mode

Yes I checked the names, they are the same. The guide tree contains the same names as in the input sequence file. I used their taxa IDs to get the guide tree off NCBI.

ADD REPLY • link 8.8 years ago by cara78 ▴ 10

0

Entering edit mode

OK. One thing I noticed in the sequence names of your tree file, you have spaces in the names, which clustalw2 doesn't like. http://www.ebi.ac.uk/Tools/msa/clustalw2/help/faq.html#11

When reading the input, clustalw2 will likely interpret "unclassified eukaryotes" and "unclassified sequences" as a duplicate entry "unclassified". If your sequence names do exactly match in both files, then I'd have to recommend you change the names of each sequence (in both files) so they do not have invisible characters, e.g. "Picomonas judraskeda" > "Picomonas_judraskeda".

Incidentally, there's also a practical length of 30 characters for sequence names that you may want to consider: http://www.ebi.ac.uk/Tools/msa/clustalw2/help/faq.html#18. The clustalw output may truncate your sequence names, so that (for example): "Paramecium bursaria Chlorella virus 1" just becomes "Paramecium bursaria Chlorella", which may or may not have an impact on your output.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by blackgore ▴ 60