I have a question that I am investigating/researching but in the meantime would like to gather feedback regarding. Would you care to speculate or hypothesize?
I would like to classify a set of proteins as either belonging to a group or not using machine learning techniques. Pretty straight forward so far. I have downloaded proteins from Uniprot, for example, protein-X vs Not protein-X. As one would expect, among the protein sequences many fragments (length < 50AA) are also present in the results.
Would you be inclined be to remove (OR not remove) the protein-X fragments (length < 50AA) from the super-set of proteins? Do the protein-X fragments represent the category of proteins being investigating or not?
I would appreciate your insights,