I'm pretty noob at coding and I finally have a problem where I absolutely want to automate.
I'm interested in a 50-or-so amino acid motif that occurs multiple times in the same protein. I can do a multiple sequence alignment of the 4 times it occurs in my favorite protein that reveals sequence conservation. It has the form:
(6-residue conserved motif)-(variable loop of ~6-20 AAs)-(12-residue conserved motif)-(short variable 5-7 AA loop)-(10-residue conserved motif)
Based on the conserved features I can see in the alignment, I can identify this motif in proteins in the NCBI database. My main goal is to be able to supply an input sequence and have the script identify all instances of the motif. My first question is: do I need to build my own hidden Markov model or are there likely tools out there that would do the trick?
I can do a multiple sequence alignment with my 4 known motifs and an unknown sequence that is expected to have the motif. This seems to help the alignment software find the pattern in the unknown sequence, if it exists. However, it only finds the best match in the unknown sequence, and I have to remove that motif from the input to then find the 2nd best-matching motif, and so on. This obviously takes way too long, but I realized I definitely want to somehow incorporate the information from a multiple alignment of motifs into the way I search for new ones. This is especially true because using my initial alignment to search for the motif often finds ones that are notably different from the ones in my favorite protein, but still have the same overall patter (e.g. hydrophobic-serine-hydrophobic-proline becomes hydrophobic-threonine-hydrophobic-proline where the hydrophobics are different and serine becomes threonine, but the pattern is the same and in in the same position in the 50AA motif). So upon finding new ones, I can align them to the ones from my favorite protein, resulting in a new alignment that better represents the possible sequence space of this motif. In my mind, as I iteratively add to my alignment, it will become better and better at identifying different versions of this motif. What would be the best way to continually update the definition of the motif that I'm searching for using the ones I've identified?