We are working on cancer mutation data and we found that upon TF binding, there has been an enrichment of mutation occurrence happened on these TF binding regions. Since we couldnt a strong reason for this occurrence we wanted to use feature selection methods.
We thought that say;
Y ( mutation no on these regions) ~ DHSites + Histone Marks + Other TF binding events (such as CTCT, EP300 etc) + RNAseq Reads
So in this matrix, we will have a single row for each binding event of our TF and all the variables will be either categorical(such as DHSsite) or numerical ( such RNAseq reads).
I have seen that people have applied random forest algorithms to predict mutation occurrence in specific regions. But our aim is not to predict anything but simple ask " What is the cause of the mutation occurrence". Therefore, I want to separate my data in to two subsets ( train vs test).
Please forgive my ignorance in the terminology and consider me as a frustrated grad student.