Question: How many sequences to phylogeny?
0
gravatar for l.souza
2.1 years ago by
l.souza70
Brasilia, Brazil
l.souza70 wrote:

Hello, everyone!

I want to create the phylogeny of the sorotypes of a virus. I retrieved the sequences from NCBI, and this is what I have:

S1: 200 sequences S2: 87 sequences S3: 549 sequences S4: 8 sequence S5: 17 sequences

Should I align all these sequences to create a tree or should I choose one of each sorotypes?

Moreover, I chose another virus of the same family to be the outgroup. Should I align it together the other sequences?

sequences outgroup phylogeny • 738 views
ADD COMMENTlink modified 2.1 years ago by jrj.healey13k • written 2.1 years ago by l.souza70
1
gravatar for jrj.healey
2.1 years ago by
jrj.healey13k
United Kingdom
jrj.healey13k wrote:

If you have the computational resources, you can keep as many sequences as you like, though MSA alignments scale (I think) non-linearly with the number of sequences so you could shrink some of the clusters if you want.

The only other consideration is that when you actually plot the tree, having too many sequences will make it unreadable.

Create a multiple sequence alignment of all the sequences you're interested in, including your outgroup. You can then make your tree with whatever tool you like. RAxML or PhyML are popular tools.

ADD COMMENTlink written 2.1 years ago by jrj.healey13k

Would you say that using the same amount of sequences of each serotype is important?

ADD REPLYlink written 2.1 years ago by l.souza70
1

Not necessarily. It all depends on the complexity within a given serotype. If the serotype is very conserved for the genes you're looking at, then you can get away with fewer within any given predicted clade, since it's more or less certain they'll all group together.

Your tree might look better with a roughly equal number between serotypes, but its easy to collapse nodes and clusters after the fact and do all the aesthetic tweaks in whatever tool you use to draw the tree. It's much more difficult to go back and add data in.

The main point is that you need sufficient numbers of sequences within any cluster you expect to see to be confident that it's a real cluster.

I would perhaps aim for ~15 sequences per clade from your dataset. If the sequences themselves aren't very long, they won't take too long to align. You will want to look at the sequence diversity within any given clade first though, you'll need fewer sequences in a conserved clade to truly represent it's diversity, so you can perhaps scale the number of sequences you use accordingly.

ADD REPLYlink written 2.1 years ago by jrj.healey13k
1
gravatar for lessismore
2.1 years ago by
lessismore660
Mexico
lessismore660 wrote:

I think it's quite sure you have many redundancies. It really depends on what your experimental design is and what you want to talk about. I would take an equal amount for S1 S2 and S3 massively filtering them (you decide on which criterias) and i would leave S4 and S5. An amount of 150 sequences for a phylogenetic tree is already enough if you want to show names and bootstraps on your picture. I suggest you to try http://etetoolkit.org/ and https://itol.embl.de/ . very user friendly tools

ADD COMMENTlink written 2.1 years ago by lessismore660
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 729 users visited in the last hour