ClustalO .phylip output is Broken?
1
0
Entering edit mode
17 months ago
NickJD • 0

Hi,

I am trying to produce a .phylip file output using the ClustalO/X systems. I have a test set of 7 viral genomes (30kb) and I have run them through both the default options of ClustalO and ClustalX.

The ClustalX software seems to produce a phylip file which works with other software such as FastTree.

However, ClustalO produces a slightly different output. The output from ClustalO does not work in FastTree and I get the following error:

No sequence in phylip line TCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCAC

As you can see from the two files, there are some minor differences which could be causing the problem.

ClustalO:

 MT084071.1--------------------------------------------------

TCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCAC


ClustalX

      MT084071.1 ---------- ---------- ---------- ---------- ----------

TCTTGTAGAT CTGTTCTCTA AACGAACTTT AAAATCTGTG TGGCTGTCAC


Is this a bug or is there a way to get ClustalO to produce 'correctly' formatted .phylip files such as ClustalX does.

Many Thanks

ClustalO phylip clustal multiplealignment • 623 views
1
Entering edit mode

Can you double check your post? I'm not convinced your formatting is correct and representative of the actual files.

All versions of Clustal should produce compatible Phylip files AFAIK.

0
Entering edit mode

Hi, I have checked the input and the output. I know this seems silly to think the output would be different but it is.

The output snippets I uploaded are from the final line of the seq IDs. As you can see, the universal different is that there is no gap between the ID and then '---' in ClustalO and that ClustalX has spacing within the sequence lines.

ClustalO

ClustalX

I know it seems crazy but the outputs are different.

1
Entering edit mode

I've edited your post to fix the images, please double check I got them the right way around.

Based on those, ClustalO is not outputting a valid phylip. The spacing in the clustalx version is correct, it I can't say that I've ever experienced an issue with ClustalO, and indeed it's the newer and recommended tool.

I can partially recreate this, as when I run clustalo, it produces the 'unbroken' sequences, but does respect the space between ID and sequence start (though this may be because my test IDs are shorter than yours).

Can you share what version of Clustal this pertains to for each?

0
Entering edit mode

Yes you fixed it. Thank you.

I used the newest version listed here for Linux: http://www.clustal.org/omega/ 1.2.4 I also used the version from apt install clustalo which is listed at 1.2.4

I think the ID spacing is indeed down to length, I can shorten my ID lengths to fix that. It does seem to be the gaps in the sequence which are needed.

Forgot to add. ClustalX is the version from apt install which is 2.1

Thanks again.

0
Entering edit mode

I am sure it is a problem with how ClustalO formats its Phylip output. ClustalW produces the same correct output as ClustalX

2
Entering edit mode
17 months ago
h.mon 33k

ClustalO is implementing the "strict" phylip format, described at Phylip documentation page, which states:

Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code.

As far as I know, most (at least, several) programs implement a "relaxed" phylip format, with one of the most common liberties taken being the "species name" (the sequence identifier) not restricted to exactly ten characters, and with a space separating the sequence identifier from the sequence data. It seems ClustaX also implements some type of (one of the many) relaxed phylip format.

The scikit documentation has a good page on the phylip format.

0
Entering edit mode

Thank you for this.

I found that if I get ClustalO to output in FASTA format for the MAS, FASTTREE and other software understands it and I can still get the trees to build.

What use are standards if we do not keep to them?

2
Entering edit mode

Phylip was written a long time ago (around 1986 if I am correct). There were not many sequences around at that time so 10 perhaps was deemed a reasonable field length. In absence of WYSWIG editors there were requirements for blanks in fields etc.

FASTA (not to be confused with an aligner of same name) format, most widely used in bioinformatics, has no formal format definition. But it continues to be used to this day.

0
Entering edit mode

What use are standards if we do not keep to them?

Amen.

Traffic: 2015 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.