Clustering based on sequence similarity
Entering edit mode
3 months ago

I've got 5 from 7 to 30 thousand virus genome sequences per each strain and I need to separate the sequences into groups based on the similarity of the sequences. How can I do that? By the way I'm able to align each strain with MAFFT, but i don't really know the way to cluster. I'd be really happy ot hear the answer

relativenessbasedclustering • 162 views
Entering edit mode

I realize it is your username, but I feel like it should be riki-Miki-tavi.

Entering edit mode
3 months ago
Mensur Dlakic ★ 14k

This part is not clear:

I've got 5 from 7 to 30 thousand virus genome sequences

You have 5 protein sequences from 30K genomes? 5-7 protein sequences? 5-7 genes?

If you are talking about whole genome clustering, that would not be easy on such a scale. I recommend that you use predicted proteins for each of them. Then:

  • align them individually
  • trim the alignments
  • concatenate those alignments into a super-matrix
  • make a phylogenetic tree

Beware that each of these steps, especially the last one, will take a long time. Also, there is a large potential for error when working on this scale, even for those who have already done all these steps before. Even if all of this works, it is very difficult to look through a tree that has 30K nodes. Lastly, most of your genomes will be (near-)identical at a protein level, so you still may not get much useful information.


Login before adding your answer.

Traffic: 2895 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6