Question

How many sequences to phylogeny?

0

Entering edit mode

6.7 years ago

l.souza ▴ 80

Hello, everyone!

I want to create the phylogeny of the sorotypes of a virus. I retrieved the sequences from NCBI, and this is what I have:

S1: 200 sequences S2: 87 sequences S3: 549 sequences S4: 8 sequence S5: 17 sequences

Should I align all these sequences to create a tree or should I choose one of each sorotypes?

Moreover, I chose another virus of the same family to be the outgroup. Should I align it together the other sequences?

phylogeny sequences outgroup • 2.5k views

ADD COMMENT • link updated 6.7 years ago by Joe 21k • written 6.7 years ago by l.souza ▴ 80

1

Entering edit mode

6.7 years ago

lessismore ★ 1.3k

I think it's quite sure you have many redundancies. It really depends on what your experimental design is and what you want to talk about. I would take an equal amount for S1 S2 and S3 massively filtering them (you decide on which criterias) and i would leave S4 and S5. An amount of 150 sequences for a phylogenetic tree is already enough if you want to show names and bootstraps on your picture. I suggest you to try http://etetoolkit.org/ and https://itol.embl.de/ . very user friendly tools

ADD COMMENT • link 6.7 years ago by lessismore ★ 1.3k

score 1 · Accepted Answer · 2017-07-28

1

Entering edit mode

6.7 years ago

Joe 21k

If you have the computational resources, you can keep as many sequences as you like, though MSA alignments scale (I think) non-linearly with the number of sequences so you could shrink some of the clusters if you want.

The only other consideration is that when you actually plot the tree, having too many sequences will make it unreadable.

Create a multiple sequence alignment of all the sequences you're interested in, including your outgroup. You can then make your tree with whatever tool you like. RAxML or PhyML are popular tools.

ADD COMMENT • link 6.7 years ago by Joe 21k

0

Entering edit mode

Would you say that using the same amount of sequences of each serotype is important?

ADD REPLY • link 6.7 years ago by l.souza ▴ 80

1

Entering edit mode

Not necessarily. It all depends on the complexity within a given serotype. If the serotype is very conserved for the genes you're looking at, then you can get away with fewer within any given predicted clade, since it's more or less certain they'll all group together.

Your tree might look better with a roughly equal number between serotypes, but its easy to collapse nodes and clusters after the fact and do all the aesthetic tweaks in whatever tool you use to draw the tree. It's much more difficult to go back and add data in.

The main point is that you need sufficient numbers of sequences within any cluster you expect to see to be confident that it's a real cluster.

I would perhaps aim for ~15 sequences per clade from your dataset. If the sequences themselves aren't very long, they won't take too long to align. You will want to look at the sequence diversity within any given clade first though, you'll need fewer sequences in a conserved clade to truly represent it's diversity, so you can perhaps scale the number of sequences you use accordingly.

ADD REPLY • link 6.7 years ago by Joe 21k