I have a small set (~100 or so) genes I have identified as being interesting in a bacterial species, and I'd like to do a simple gene enrichment analysis to see what functional categories may be overrepresented in the set. Tools for the job include Blast2Go, with which you can implement a Fisher's exact test, and GoEast, which uses other (perhaps better?) statistical tests such as the hypergeometric test. (If there are others I'd love to know about them also.)

However, these tools require a user-defined 'background' or 'reference' set of genes to be uploaded, and so my question is: how do you define the background gene set? Is it better to use the whole genome of the organism in question, such that the test becomes: given a genome full of genes with associated functions, what GO categories are overrepresented in the subset of genes I have highlighted? Or is it perhaps better to provide a randomly selected reference set of equivalent size, ie. randomly select ~100 genes to use as a background representation of 'random' functional diversity? Tips and common protocols for this type of analysis would be much appreciated!

Thanks in advance!

PS, I am aware there are a number of questions with similar titles already on Biostars, but I don't think any directly answer this question - apologies if I've missed something.

PPS, perhaps I should say my genes of interest genes are NOT identified based on differential expression experiments or suchlike, but I do have access to whole-genome data for my organism.

The underlying assumptions in most enrichment analyses are that most genes do NOT change. Thus it would not even matter if you used all genes or only genes that are not in your selection. I would say that there is no advantage in providing a smaller subset as it only decreases the power of the test (although that probably depends on the type of test).  

Thanks Istvan

