Hi, all. Thank you for reading this question.
I have some special sequences in proteins showed as follows:
All proteins contain one similar segments, and repeat in different times.
Take 'seq1' as an example, it has a sequence(not motif in biology meanings) 'EETELICLSDVTATSGAMEHSEVILKEREE' repeated 5 times. The number after colon is the position of this segment in protein.
seq1 EETELICLSDVTATSGAMEHSEVILKEREE:420# EETELICLNDVTSPLRAVEHSAVLLKDKVE:473# EETELICVNDVTSTSRTMGHSSVVLKENEE:527# EETELICLNDVTSPSRAMEHSTVFIEEKEE:580# EETELICLNDVTSTSEVAETPEDVLEGIEL:633
seq2 EETELICLNDVTSPLRAVEHSAVLLKDKVE:473# EETELICLNDVTSPSRAMEHSTVFIEEKEE:580
seq3 EETELICLNDFTSTSRAMEHSEVILKEREE:421# EETELICVNDVTSTSHAMEHSAVILKENEE:527# EETKLICLNEVTSTSRAMEHSAVVIEDKAE:580
seq4 EETELICLNDVTSPLRAVEHSAVLLKDKVE:473# EETELICLNDVTSPSRAMEHSTVFIEEKEE:580
I want to compute the similarity score between any two sequences based on mentioned sequence information. How can I do it? I want to consider two factors, the similarity of motifs and repeat times.
Before I joined them together as one sequence and used multiple alignment software get the similarity score of them. But I do not know if it is reasonable.