Random Subset of Individuals from .BED File (.ped not available)
2
0
Entering edit mode
6.0 years ago
angus.gane • 0

I am trying to split a a GWAS cohort into two random samples. I have the .bed, .fam, and .bim files. I know plink has commands for filtering out subsets of individuals (--filter) but this seems to require the .map file. It is possible to filter binary files on plink but it doesn't seem to allow this for the first two 'columns' - which contain the individual data I need to filter using.

My very computationally intensive solution has been to recode the .bed file and .ped and .map files for each chromosome (800GB+), randomly select a cohort of individuals with shuf and then grep these out of the .ped file before recoding as .bed files.

I was wondering if anyone had a better way of doing this?

Thanks, Angus

plink GWAS • 3.1k views
ADD COMMENT
1
Entering edit mode

Are you doing this for some 'machine learning' or bootstrapping method?, i.e., breaking the dataset up into training and testing?

Just do the following:

  1. obtain a sample ID listitng
  2. 'randomly' select sample IDs from the listing (using any programming language)
  3. use --keep or --remove on your BED files to keep or remove samples accordingly
ADD REPLY
1
Entering edit mode
2.3 years ago
angus.gane • 0

Thank you all. Looking back on this a few years later there are a few possible approaches.

In the end I used sort -R on the fam file, extracted a testing and a training set with head and tail and then used:

plink1.9.exe --bfile file --keep set1.fam --make-bed --out subset1
plink1.9.exe --bfile file --keep set2.fam --make-bed --out subset2

In addition of course a few checks to ensure everything went ok!

ADD COMMENT
1
Entering edit mode
2.3 years ago

800GB of data per chromosome is a lot. If shuf does not scale to the size of data you are working with and you get out-of-memory errors, then the sample application might be of use. It samples like shuf, but uses a simple trick to reduce memory usage to 8 bytes per line.

ADD COMMENT

Login before adding your answer.

Traffic: 2423 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6