I will be training a classifier based on the results of unsupervised clustering of genes. The overall goal is to determine which chromatin or epigenetic features in these genes are the best features indicating that they are regulated by a central protein regulator.
First, I will cluster genes based on their expression profiles under different treatment conditions from multiple RNA-Seq experiments. We are working on the assumption that genes that are clustered together are co-regulated by this central protein. Then from the results of the clustering, we identify which specific cluster is enriched in genes that are already known to be regulated by this central protein. Genes in these cluster (in addition to the validated ones) will be used as the positive examples for training a classifier. The input features include methylation and other chromatin features. Then from the best performing model, we get the features that are most important or have the highest coefficients. We can then validate that these features are important by performing experiments in the lab.
I just want some insights in the machine learning point of view Thank you very much.