Currently, I'm doing a research to develop a deep learning method to predict of two proteins have similar function given the features of the two proteins.
To build this deep learning model, I need a proper dataset to train it, the requirements of the dataset are:
1 contains enough protein pairs with quite a number of pairwise features
2 each protein appears in this dataset with known GO terms(using GO terms to calculate semantic similarity of two proteins as label to train the model)
is there any dataset can meet my demands?
what's more, now, I only found a dataset here: http://mine5.ics.uci.edu:1026/gain.html
it was generated from Lindahl's dataset with pairwise features, but without GO term annotation,
Total number of unique proteins: 976 Total number of query-template pairs: 951600
some of the proteins' name like these
I don't what does the name format means, it's like combination of PDB and SCOP
If I use this dataset， how can I find the GO terms of each proteins in the this dataset