Trainng and validation set selection

0

Entering edit mode

4.8 years ago

mel22 ▴ 100

Dear all, I am working on a big population of many pooled case-controls study , and for the genetic analysis I would like to perform a first analysis on a part of the population than to validate the results on the second part. How I can I have two similar groups from the initial population ? How can I do this ?

Thank you for your help !

R case control genotyping • 652 views

ADD COMMENT • link 4.8 years ago by mel22 ▴ 100

1

Entering edit mode

Without knowing anything about the structure of the data and how it's going to be processed, the only advice that can be given is to use a random split. For machine learning applications, it's common to use 67-80% of the data for training and the rest for testing. Both the training set and the test set have to be representative and the test set has to be large enough for results to be meaningful.

ADD REPLY • link 4.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you Jean-Karim, It's envirmontal exposure data and genotyping data (DNA Chip), and I would like to caracterize interaction between exposure and some SNP's. So I am trying to validate my results in a secod part of the population ... I am using plink and R, how can I split may data in R in the best way (accepted methodology) ?

Thank you

ADD REPLY • link 4.8 years ago by mel22 ▴ 100

0

Entering edit mode

If you're going to use R to apply supervised machine learning algorithms, I would suggest to look into the caret package. It has a createDataPartition() function for splitting data.

ADD REPLY • link 4.8 years ago by Jean-Karim Heriche 27k

Login before adding your answer.