Question: ClustalO .phylip output is Broken?
0
gravatar for NickJD
6 months ago by
NickJD0
NickJD0 wrote:

Hi,

I am trying to produce a .phylip file output using the ClustalO/X systems. I have a test set of 7 viral genomes (30kb) and I have run them through both the default options of ClustalO and ClustalX.

The ClustalX software seems to produce a phylip file which works with other software such as FastTree.

However, ClustalO produces a slightly different output. The output from ClustalO does not work in FastTree and I get the following error:

No sequence in phylip line TCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCAC

As you can see from the two files, there are some minor differences which could be causing the problem.

ClustalO:

 MT084071.1--------------------------------------------------

 TCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCAC
  

ClustalX

      MT084071.1 ---------- ---------- ---------- ---------- ---------- 

      TCTTGTAGAT CTGTTCTCTA AACGAACTTT AAAATCTGTG TGGCTGTCAC
  

Is this a bug or is there a way to get ClustalO to produce 'correctly' formatted .phylip files such as ClustalX does.

Many Thanks

ADD COMMENTlink modified 6 months ago by h.mon31k • written 6 months ago by NickJD0
1

Can you double check your post? I'm not convinced your formatting is correct and representative of the actual files.

All versions of Clustal should produce compatible Phylip files AFAIK.

ADD REPLYlink written 6 months ago by Joe18k

Hi, I have checked the input and the output. I know this seems silly to think the output would be different but it is.

The output snippets I uploaded are from the final line of the seq IDs. As you can see, the universal different is that there is no gap between the ID and then '---' in ClustalO and that ClustalX has spacing within the sequence lines.

ClustalO

ClustalO

ClustalX

ClustalX

I know it seems crazy but the outputs are different.

ADD REPLYlink modified 6 months ago by Joe18k • written 6 months ago by NickJD0
1

I've edited your post to fix the images, please double check I got them the right way around.

Based on those, ClustalO is not outputting a valid phylip. The spacing in the clustalx version is correct, it I can't say that I've ever experienced an issue with ClustalO, and indeed it's the newer and recommended tool.

I can partially recreate this, as when I run clustalo, it produces the 'unbroken' sequences, but does respect the space between ID and sequence start (though this may be because my test IDs are shorter than yours).

Can you share what version of Clustal this pertains to for each?

ADD REPLYlink modified 6 months ago • written 6 months ago by Joe18k

Yes you fixed it. Thank you.

I used the newest version listed here for Linux: http://www.clustal.org/omega/ 1.2.4 I also used the version from apt install clustalo which is listed at 1.2.4

I think the ID spacing is indeed down to length, I can shorten my ID lengths to fix that. It does seem to be the gaps in the sequence which are needed.

Forgot to add. ClustalX is the version from apt install which is 2.1

Thanks again.

ADD REPLYlink modified 6 months ago • written 6 months ago by NickJD0

I am sure it is a problem with how ClustalO formats its Phylip output. ClustalW produces the same correct output as ClustalX

ADD REPLYlink modified 6 months ago • written 6 months ago by NickJD0
2
gravatar for h.mon
6 months ago by
h.mon31k
Brazil
h.mon31k wrote:

ClustalO is implementing the "strict" phylip format, described at Phylip documentation page, which states:

Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code.

As far as I know, most (at least, several) programs implement a "relaxed" phylip format, with one of the most common liberties taken being the "species name" (the sequence identifier) not restricted to exactly ten characters, and with a space separating the sequence identifier from the sequence data. It seems ClustaX also implements some type of (one of the many) relaxed phylip format.

The scikit documentation has a good page on the phylip format.

ADD COMMENTlink written 6 months ago by h.mon31k

Thank you for this.

I found that if I get ClustalO to output in FASTA format for the MAS, FASTTREE and other software understands it and I can still get the trees to build.

What use are standards if we do not keep to them?

ADD REPLYlink written 6 months ago by NickJD0
2

Phylip was written a long time ago (around 1986 if I am correct). There were not many sequences around at that time so 10 perhaps was deemed a reasonable field length. In absence of WYSWIG editors there were requirements for blanks in fields etc.

FASTA (not to be confused with an aligner of same name) format, most widely used in bioinformatics, has no formal format definition. But it continues to be used to this day.

ADD REPLYlink written 6 months ago by GenoMax93k

What use are standards if we do not keep to them?

Amen.

ADD REPLYlink written 6 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1058 users visited in the last hour