Random Subset of Individuals from .BED File (.ped not available)
2
0
Entering edit mode
4.2 years ago
angus.gane • 0

I am trying to split a a GWAS cohort into two random samples. I have the .bed, .fam, and .bim files. I know plink has commands for filtering out subsets of individuals (--filter) but this seems to require the .map file. It is possible to filter binary files on plink but it doesn't seem to allow this for the first two 'columns' - which contain the individual data I need to filter using.

My very computationally intensive solution has been to recode the .bed file and .ped and .map files for each chromosome (800GB+), randomly select a cohort of individuals with shuf and then grep these out of the .ped file before recoding as .bed files.

I was wondering if anyone had a better way of doing this?

Thanks, Angus

1
Entering edit mode

Are you doing this for some 'machine learning' or bootstrapping method?, i.e., breaking the dataset up into training and testing?

Just do the following:

1. obtain a sample ID listitng
2. 'randomly' select sample IDs from the listing (using any programming language)
3. use --keep or --remove on your BED files to keep or remove samples accordingly
1
Entering edit mode
5 months ago
angus.gane • 0

Thank you all. Looking back on this a few years later there are a few possible approaches.

In the end I used sort -R on the fam file, extracted a testing and a training set with head and tail and then used:

plink1.9.exe --bfile file --keep set1.fam --make-bed --out subset1
plink1.9.exe --bfile file --keep set2.fam --make-bed --out subset2


In addition of course a few checks to ensure everything went ok!

1
Entering edit mode
5 months ago

800GB of data per chromosome is a lot. If shuf does not scale to the size of data you are working with and you get out-of-memory errors, then the sample application might be of use. It samples like shuf, but uses a simple trick to reduce memory usage to 8 bytes per line.