I'm working on a project where I am trying to classify stretches of amino acids as phosphosites/non phosphosites. I'm getting the phosphosites from PhosphoBase, but as for getting a negative training set I'm not sure what to do.
When I randomly sample S/T/Y sites from the phosphobase proteins I get about the same % classified as tyrosine phosphosite/serine phosphosite/threonine phosphosite for the positive and negative datasets using regular expressions, which is what I'm using as the baseline to compare to. (The regular expressions are from prosite, the general regular expressions for the serine/threonine/tyrosine kinases)
Any thoughts on how I could generate a better negative dataset?