Trainng and validation set selection
0
0
Entering edit mode
4.8 years ago
mel22 ▴ 100

Dear all, I am working on a big population of many pooled case-controls study , and for the genetic analysis I would like to perform a first analysis on a part of the population than to validate the results on the second part. How I can I have two similar groups from the initial population ? How can I do this ?

Thank you for your help !

R case control genotyping • 652 views
ADD COMMENT
1
Entering edit mode

Without knowing anything about the structure of the data and how it's going to be processed, the only advice that can be given is to use a random split. For machine learning applications, it's common to use 67-80% of the data for training and the rest for testing. Both the training set and the test set have to be representative and the test set has to be large enough for results to be meaningful.

ADD REPLY
0
Entering edit mode

Thank you Jean-Karim, It's envirmontal exposure data and genotyping data (DNA Chip), and I would like to caracterize interaction between exposure and some SNP's. So I am trying to validate my results in a secod part of the population ... I am using plink and R, how can I split may data in R in the best way (accepted methodology) ?

Thank you

ADD REPLY
0
Entering edit mode

If you're going to use R to apply supervised machine learning algorithms, I would suggest to look into the caret package. It has a createDataPartition() function for splitting data.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6