Question: HT-Seq count data
0
14 days ago by
Rob30
Rob30 wrote:

Hi friends How should I split my data of 500 patients with RNA seq data into 2/3 training and 1/3 validation sets randomly? I tried to select randomly in excel, but result gives repeated patients in my sets. How can I use randomly without having duplicate patients?

rna-seq • 92 views
written 14 days ago by Rob30
3

Excel is not a good tool for advanced statistics. Please use ML libraries in R/python to split data into training and test sets. A little bit of google will show you pre-existing functions in scikit-learn and R that can split your data without you having to manually do much work.

1

If it is only about the splitting, in R you can randomly generate numbers, here 167 random numbers between 1 and 500 withoout duplicates:

``````> sample(seq(1,500), round(500*(1/3)), replace = FALSE)
[1] 317 335  60 479 136  16  12 366 303 325 245  78 478 307 127 425 500 469 360 446 130 257 463 419  35 198  99
[28] 170 113 102 364 165 302 294 215 481 367 129 449  90  73 251 296 137 347 409 394 187  10  39 106 428 281 447
[55] 451 298 101 125 395 224 291 402 228 464 167 162 240 359  32  43 435 169 321 339  66 380 260  48 311 377 285
[82] 135 470 404 107 178 158 429 152 221 495  79 386 286  36 255 183  71 383 494  21 230 319 476 490 145 493 387
[109] 314  41 416  63 100 310 141 406 334 121  85   3 272 282  87 427 287  40  94 212 206  53 412 258 229 144 370
[136] 358 203 234  30 168 332 309 156 241  15 437 163  64 474 242 181 398  17 442 210 346 443 320 188 403 108  31
[163]  56  27  11 460 329
``````

As _r_am suggests, please get familiar with proper programming languages. I guarantee you that you do not want to load the expression profile of 500 patients into Excel.

Thanks ATpoin This worked great