7.7 years ago by
It does depend on your type of skill and area of interest but I will mention one idea: counting frequencies of various amino-acid features in Uniprot (15 million proteins) is really interesting as a starter project.
In my data-mining project I counted the frequency of 7-mers in Uniprot https://github.com/alevchuk/nmercount-classifier (in a sample of 10% of Uniprot)
Some patterns that show up are really mysterious and intriguing (e.g. these 3 clusters for the 2 and 3-mer contns). And the fluid-like shape changes of the data as the n in n-mer is increased.
Perhaps you could do something similar but break it up into the 3 domains of life for example (the species is known for each protein in Uniprot).