Clustering based on sequence similarity
1
0
Entering edit mode
2.8 years ago

I've got 5 from 7 to 30 thousand virus genome sequences per each strain and I need to separate the sequences into groups based on the similarity of the sequences. How can I do that? By the way I'm able to align each strain with MAFFT, but i don't really know the way to cluster. I'd be really happy ot hear the answer

relativenessbasedclustering • 461 views
ADD COMMENT
0
Entering edit mode

I realize it is your username, but I feel like it should be riki-Miki-tavi.

ADD REPLY
0
Entering edit mode
2.8 years ago
Mensur Dlakic ★ 27k

This part is not clear:

I've got 5 from 7 to 30 thousand virus genome sequences

You have 5 protein sequences from 30K genomes? 5-7 protein sequences? 5-7 genes?

If you are talking about whole genome clustering, that would not be easy on such a scale. I recommend that you use predicted proteins for each of them. Then:

  • align them individually
  • trim the alignments
  • concatenate those alignments into a super-matrix
  • make a phylogenetic tree

Beware that each of these steps, especially the last one, will take a long time. Also, there is a large potential for error when working on this scale, even for those who have already done all these steps before. Even if all of this works, it is very difficult to look through a tree that has 30K nodes. Lastly, most of your genomes will be (near-)identical at a protein level, so you still may not get much useful information.

ADD COMMENT

Login before adding your answer.

Traffic: 3082 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6