Question

Clustering based on sequence similarity

0

Entering edit mode

4.0 years ago

Мики-рики-тави • 0

I've got 5 from 7 to 30 thousand virus genome sequences per each strain and I need to separate the sequences into groups based on the similarity of the sequences. How can I do that? By the way I'm able to align each strain with MAFFT, but i don't really know the way to cluster. I'd be really happy ot hear the answer

relativenessbasedclustering • 644 views

ADD COMMENT • link updated 4.0 years ago by Mensur Dlakic ★ 29k • written 4.0 years ago by Мики-рики-тави • 0

0

Entering edit mode

I realize it is your username, but I feel like it should be riki-Miki-tavi.

ADD REPLY • link 4.0 years ago by Mensur Dlakic ★ 29k

score 0 · Answer 1 · 2021-07-06

This part is not clear:

I've got 5 from 7 to 30 thousand virus genome sequences

You have 5 protein sequences from 30K genomes? 5-7 protein sequences? 5-7 genes?

If you are talking about whole genome clustering, that would not be easy on such a scale. I recommend that you use predicted proteins for each of them. Then:

align them individually
trim the alignments
concatenate those alignments into a super-matrix
make a phylogenetic tree

Beware that each of these steps, especially the last one, will take a long time. Also, there is a large potential for error when working on this scale, even for those who have already done all these steps before. Even if all of this works, it is very difficult to look through a tree that has 30K nodes. Lastly, most of your genomes will be (near-)identical at a protein level, so you still may not get much useful information.