Hi, all. Thank you for reading this question.
I have some special sequences in proteins showed as follows:
All proteins contain one similar segments, and repeat in different times.
Take 'seq1' as an example, it has a sequence(not motif in biology meanings) 'EETELICLSDVTATSGAMEHSEVILKEREE' repeated 5 times. The number after colon is the position of this segment in protein.
seq1 EETELICLSDVTATSGAMEHSEVILKEREE:420# EETELICLNDVTSPLRAVEHSAVLLKDKVE:473# EETELICVNDVTSTSRTMGHSSVVLKENEE:527# EETELICLNDVTSPSRAMEHSTVFIEEKEE:580# EETELICLNDVTSTSEVAETPEDVLEGIEL:633
seq2 EETELICLNDVTSPLRAVEHSAVLLKDKVE:473# EETELICLNDVTSPSRAMEHSTVFIEEKEE:580
seq3 EETELICLNDFTSTSRAMEHSEVILKEREE:421# EETELICVNDVTSTSHAMEHSAVILKENEE:527# EETKLICLNEVTSTSRAMEHSAVVIEDKAE:580
seq4 EETELICLNDVTSPLRAVEHSAVLLKDKVE:473# EETELICLNDVTSPSRAMEHSTVFIEEKEE:580
I want to compute the similarity score between any two sequences based on mentioned sequence information. How can I do it? I want to consider two factors, the similarity of motifs and repeat times.
Before I joined them together as one sequence and used multiple alignment software get the similarity score of them. But I do not know if it is reasonable.
Thanks, both ideas are great. I did the first analysis before. It gives the evolutionary history of different motifs which are very interesting. About the second idea, if I use '-' substitute the motifs part, will it give the same result comparing to removing motifs?
That cannot be predicted. You'll have to perform the analysis and compare results. That does not seem overly difficult with 5 sequences, plus perhaps an outlier, if it exists, that naturally does not contain the motif.
Thanks. I have compared them. No much difference at least to this example.