I have a set of 800 genes so that every gene is evidence based involved in a particular disease. These are all protein coding genes. Let's call this set A. For a machine learning project I need a set of genes that is with high probability not involved in that particular disease. Let's call this set B. Usually in science one focuses in finding a gene that is involved in a pathology and not the opposite, i.e. finding a gene that is not involved in that particular pathology. But one can eventually make predictions based on some strategy in generating the set B. Knowing the genes of set A in terms of their identity, location in the genome, GO terms associated to each element of A, expression profile in a certain tissue where the pathology arises etc., how could one propose the set B based on some bioinformatics approach? Can one indicate a paper or present his/her own experience with a similar question ? The set A is not based on GWAS. It seems that GWAS may not play a role in this case.
I first thought to assign to the set B houskeeping genes. It turns out that the intersection set between A and B will contain 10% of A. Can one do better than houskeeping genes in building B ? Can the spread in the genome of the set A give some clues for the set B ?
One may eventually try to find convergent (common) GO terms of all elements of A. And then exclude all (protein coding) genes which are not in A but share common GO terms with A. The set difference of all protein coding genes in the genome and the set A will provide a potential list for B which may be further refined by choosing particular GO terms in filtering the elements of B. What is your opinion on this suggestion? Can one use R for doing the research going through GO terms described above?
Many thanks for any comment.