I have a set of 800 genes so that every gene is evidence based involved in a particular disease. These are all protein coding genes. Let's call this set A. For a machine learning project I need a set of genes that is with high probability not involved in that particular disease. Let's call this set B. Usually in science one focuses in finding a gene that is involved in a pathology and not the opposite, i.e. finding a gene that is not involved in that particular pathology. But one can eventually make predictions based on some strategy in generating the set B. Knowing the genes of set A in terms of their identity, location in the genome, GO terms associated to each element of A, expression profile in a certain tissue where the pathology arises etc., how could one propose the set B based on some bioinformatics approach? Can one indicate a paper or present his/her own experience with a similar question? The set A is not based on GWAS. It seems that GWAS may not play a role in this case.
I first thought to assign to the set B housekeeping genes. It turns out that the intersection set between A and B will contain 10% of A. Can one do better than housekeeping genes in building B? Can the spread in the genome of the set A give some clues for the set B?
One may eventually try to find convergent (common) GO terms of all elements of A. And then exclude all (protein coding) genes which are not in A but share common GO terms with A. The set difference of all protein coding genes in the genome and the set A will provide a potential list for B which may be further refined by choosing particular GO terms in filtering the elements of B. What is your opinion on this suggestion? Can one use R for doing the research going through GO terms described above?
Many thanks for any comment.
Good question. Maybe explaining exactly how did you select genes in set A would be useful.
Thanks. The set A is based on evidence based proof that each element of A is either by cause or strong association related to the disease. One can think of genetic tests or genome sequencing of the affected individuals and eventually their parents and siblings. There are databases that care about this evidence and provide the list of genes associated to a score of evidence as for the involvement of one gene in that particular pathology. I did not select A. It was given.