How to convert fasta file format to phylip file format
2
1
Entering edit mode
3.4 years ago
Mike ★ 1.7k

Hi all,

I have fasta sequence of some proteins and I want to convert fasta format to phylip file format to build phylogenetic tree using ggtree. I tried online EMBOSS seqret tool to convert fasta file to phylip format but I got error when i read in ggtree.

my input sequence

>proteinsA
RTDKKPALCKSYQKLVSEVWHKKRPSYVVP
>proteinsB
MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGT
SHVSFP
>proteinsC
ITTEETMEEDKSQSDVDFQSCESCSNSDRAENENGSRCFSEDNNETTMLIQDDENN


and EMBOSS seqret output is..

 3 116

proteinsA MGDSRDLCPH LDSIGEVTKE DLLLKSKGTC QSCGVTGPNL WACLQVACPY
proteinsB MTGSNSHITI LTLKVLPHFE SLGKQEKIPN KMSAFRNHCP HLDSVGEITK
proteinsC MGDSRDLCPH LDSIGEVTKE DLLLKSKGTC QSCGVTGPNL WACLQVACPY

EDLIQKSLGT SHVSFP---- ---------- ---------- ----------

---------- ------
---------- ------
EDNNETTMLI QDDENN


But I got error in reading this phylip file...

tree <- read.phylip("emboss_seqret_output.txt")


Error in read.phylip("emboss_seqret_output.txt") :
input file is not phylip tree format...


Thanks a lot.

phylip fasta ggtree Phylogenetic tree R • 10k views
1
Entering edit mode

I use different tools to build phylogenetic trees. But I also need to convert fasta to phylip.

To convert fasta to phylip: http://sequenceconversion.bugaco.com/converter/biology/sequences/fasta_to_phylip.php

A program for phylogenetic trees: http://www.atgc-montpellier.fr/phyml/

Other useful programs from that site: http://www.atgc-montpellier.fr/index.php?type=pg

1
Entering edit mode

Don’t convert fasta to phylip. That tool is steering you wrong. While it is possible to represent the 2 files in a visually similar manner you should not do this as a text manipulation. The input sequences should be fed to an alignment program.

1
Entering edit mode

i was able to load phylip file you posted here (output from emboss seqret) using read.phylip function from phylotools. I used both .phy and .txt extension. In either case, I didn't see a difference. I think you are using read.phylip function coming from treeio package.@ Mike. Minor changes I made were to insert an extra space between sequence IDs and sequences, removed extra line between very 1st line and next line.

> library("phylotools")
seq.name
1 proteinsA
2 proteinsB
3 proteinsC
seq.text
2 MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGTSHVSFP--------------------------------------------------
Warning message:
In readLines(infile) : incomplete final line found on 'file.phy'
seq.name
1 proteinsA
2 proteinsB
3 proteinsC
seq.text
2 MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGTSHVSFP--------------------------------------------------
Warning message:
In readLines(infile) : incomplete final line found on 'file.txt'

0
Entering edit mode

Thanks cpad0112, yes I can read file in read.phylip function from phylotools but not from ggtree/ treeio. How can I build tree using this phylip file in phylotools.

0
Entering edit mode

I guess you have resolved the issue. For future reference, the tool needs sequential phylip format not a interleaved format. It also needs dendrogram information (nexus may be) at the end of phy format file.

0
Entering edit mode

You may need to make the file extension “.phy”.

Also, I’m not sure if its just how you’ve copied and pasted, but there isn’t normally a space between the 2 numbers in the first line, and the start of the alignment itself (as least as far as I have seen in the past, and PHYLIP is one of the more strict formats).

The bigger issue here is that you should not be “converting” a fasta to a PHYLIP. A phylip is an alignment file, not just a sequence representation. For your tree to be meaningful at all you need to align the sequences, using something like CLUSTAL or MUSCLE.

0
Entering edit mode

It is not copied and pasted file , I downloaded from from EMBOSS seqret result page as per below...

I have also MAFFT (alignment file) file but dont know how to use this file for generate tree.

0
Entering edit mode

That confirms my suspicions about the spacing of the first and second lines in your pasted example.

You can try to fix it, but its not the file you should be using. Can you paste what your MAFFT output looks like?

0
Entering edit mode
>proteinsA
-------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGT
------QKLVSEVWHKKRPSYVVP-----
>proteinsB
MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGT
S--------------HVSFP----------------------------------------
-----------------------------
>proteinsC
-------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGT
AENENGSRCFSE--DNNETTMLIQDDENN

2
Entering edit mode

That’s an aligned fasta (though to my eye it looks to be a fairly poor alignment) - proceed with caution.

Most tree building software will be able to accept fasta as an input. Otherwise you have 2 options:

1. Go back to MAFFT and request phylip as the output format directly.
2. Convert the aligned fasta to a phylip.

Additionally, ggtree is not a tree construction program, it is just for rendering/plotting precalcuated trees. From there documentation apparently it supports “phylip tree format”, not a format I’m familiar with, but still requires a newick representation tree in the phylip with the aligned sequences.

I would probably start over from you original fasta, align with MAFFT/Clustal/whatever, output directly as a phylip, then use something like IQTREE to actually calculate the tree itself.

Lastly I would just ask: is this a toy data set for our benefit or have you really only got 3 sequences?

0
Entering edit mode

Thanks jrj.healey for your help, I have around 150 protein sequences, this is just toy/example data.

0
Entering edit mode

ggtree expects a phylip file with the newick string. The file you have converted using Seqret does not have the newick string.

 read.phylip    parsing phylip file (phylip alignment + newick string)

0
Entering edit mode

Thanks Sej, thats my problem, how to generate phylip file (phylip alignment + newick string) to plotting in ggtree.

0
Entering edit mode

No need. You don’t need the phylip at all, you just need a newick formatted tree, which is the most common output for any phylogenetics tool.

Use a tool like IQTREE, and just take the treefile it gave you. You do not need to do anything else.

0
Entering edit mode

see if this is what you want @ Mike :

input:

\$ cat test.fa
>proteinsA
-------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGT
------QKLVSEVWHKKRPSYVVP-----
>proteinsB
MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGT
S--------------HVSFP----------------------------------------
-----------------------------
>proteinsC
-------------------------------MGDSRDLCPHLDSIGEVTKEDLLLKSKGT


output:

#NEXUS
begin data;
dimensions ntax=3 nchar=149;
format datatype=protein missing=? gap=-;
matrix
proteinsB MTGSNSHITILTLKVLPHFESLGKQEKIPNKMSAFRNHCPHLDSVGEITKEDLIQKSLGTS--------------HVSFP---------------------------------------------------------------------
;
end;


from Bio import AlignIO
from Bio.Alphabet import IUPAC, Gapped

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(output_file, "w") as o:
with open(input_file, "r") as i:
infa = AlignIO.parse(i, "fasta", alphabet=Gapped(IUPAC.protein))
AlignIO.write(infa, o, "nexus")

0
Entering edit mode

This doesn’t solve OPs problem because it still contains no dendrogram information.

1
Entering edit mode
3.4 years ago
Mike ★ 1.7k

I found a nice tutorial to build Phylogenetic Trees from fasta sequence...

http://www.cbs.dtu.dk/courses/biosys/binfintro/phylogeny.php

Step 1: Open the sequence file (fasta), select the entire file, and copy the sequences.

Step 2 : Align the sequences in using the mafft server at EBI with default settings as follows

Step 3: Open the TreeHugger web server. (The TreeHugger server constructs a neighbor joining tree from an aligned set of sequences).

Step 5: Visualizing using ggtree

library(ggtree)
ggtree(tree) + geom_tiplab()

0
Entering edit mode

2
Entering edit mode
3.4 years ago
Guangchuang Yu ★ 2.5k

ggtree support phylip tree format but not phylip mutiple sequence alignment.

the phylip tree file contains msa in the famous phylip format with additional record of corresponding tree in newick text.

ggtree supports visualizing phylogenetic tree and you need to have a tree before passing it to ggtree.

the phylip sequence file only contains sequence and you need to construct the tree before visualizing it.

I am the author of ggtree and recommend you to post ggtree question to the google group.