Question

How to group some nucleotide sequences based on similarity.

0

Entering edit mode

4.3 years ago

arriyaz.nstu ▴ 30

I have 20 nucleotide sequences of a particular gene. I collected these sequences from 20 different strains of a virus. My target is to separate them into different groups based on similarity. Finally, I will find out a consensus sequence from each group. How I can do this???

sequence alignment • 1.8k views

ADD COMMENT • link updated 4.3 years ago by yairgatt ▴ 10 • written 4.3 years ago by arriyaz.nstu ▴ 30

0

Entering edit mode

By doing a multiple sequence alignment. You can use a local program like MEGA/clustal or an online web interface e.g. clustal omega.

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

As far as I know, through MSA I will get only one consensus sequence for all 20 nucleotide sequences. But, I want to group the sequence first (maybe 3 or 4 groups), the most similar sequence will be put together and one consensus seq for each group. Actually I'm very new to Bioinformatics, maybe I am wrong.

ADD REPLY • link 4.3 years ago by arriyaz.nstu ▴ 30

1

Entering edit mode

If you know which sequences are more homologous to each other, you could separate them before doing individual MSA's.

If you don't have an idea, then try doing an initial MSA with all sequences (since you said they are from a particular gene they should be reasonably homologous to do that alignment). Examine the results of the alignment (plot a distance tree) and then decide on the groups you want to break the sequences into before doing individual alignments to get a consensus.

ADD REPLY • link 4.3 years ago by GenoMax 141k

score 1 · Answer 1 · 2019-12-23

1

Entering edit mode

4.3 years ago

yairgatt ▴ 10

I am not certain what the question you are hoping to answer is, but your could cluster the sequences using a program like CD-HIT. If you are trying to assess the evolutionary linkage between the strains, it might be best to construct a dendrogram or a phylogenetic tree for these sequences.

ADD COMMENT • link 4.3 years ago by yairgatt ▴ 10

0

Entering edit mode

Thank you for your suggestion. My main target is clustering the sequences based on similarity. I think CD-HIT will be a solution.

ADD REPLY • link 4.3 years ago by arriyaz.nstu ▴ 30

1

Entering edit mode

You have to keep in mind that cd-hit is going to cluster solely based on sequence. It will not take into consideration evolutionary relationships between sequences (or introduce gaps where needed).

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

Luckily, I only need to group them based on similarity. I am not going to do any evolutionary analysis. Thank you for your help.

ADD REPLY • link 4.3 years ago by arriyaz.nstu ▴ 30