My questions pertains to features that can solely be derived from sequence. Would You think the that docking domains are specific to terminal regions? In other words I have a set of potential substrates. Do I have to consider the entire chain of amino acids? For example in MEKS I would solely consider the N terminal regions yet once I am working with all sorts of proteins I am not sure.
I'm a big fan of the ELM database. This curates a list of small motifs (regular expressions) which are known to facilitate various forms of binding. PROSITE domains also represent similar sets of information.
As to whether they are likely to be on terminal regions? I would not make that assumption, there are certainly examples of N (or C) terminal recognition sequences but I do not think that they generalize to all protein-protein interactions.
A simple ELM scan of the entire human proteome does not reveal a terminal preference (p = 0.35 when testing the distance from N or C terminal).
Terminii often make good docking/interaction targets since they tend to be flexible, solvent-accessible and disordered. However, interaction domains are not specific to terminii; there are plenty of examples of flexible loop regions internal to the sequence that form intermolecular interactions.
I'd suggest searching PubMed for a recent review on the state of the art in prediction of protein-protein interaction. Most approaches use a combination of features: structural, physicochemical, sequence conservation.
Prediction of protein-protein docking or protein-protein interaction prediction?
Will already suggested some interesting features that you may consider. IMHO, this is an important problem in the application of machine learning approaches in bioinformatics and several interesting papers with a variety of features based on the concepts of evolution, domain-domain interaction, sequence similarity, homology... etc are available. If you are new to the this particular area, I would recommend you to start with some of the review articles (1, 2, 3 and several other review articles via PubMed/IEEE/Google).
Due to the availability of several algorithms using sequence based features in this domain your challenge will be using set of novel features or generate hybrid features using available features to get better prediction results. In a recent work, I have used hybrid features by combining two or more sequence features and noticed better contribution of those features in overall prediction.
I would also like to point you to a recent question discussed at Biostar on Sequence based protein interaction information that discussed some of the available tools for protein-protein interaction.