Converting PLINK binary files into python dataframe
1
0
Entering edit mode
19 months ago
bbehrooz • 0

I'm working with a genetic dataset (roughly 23,000 samples and 300,000 SNPs as features). I got my files in PLINK binary format files (.bed, .bim, .fam). Listed below are their sizes:

.bed file : 1.6G .bim file = 9.3M .fam file = 737K My aim is to convert them into (pandas) dataframes and then start my predictive analysis in Python (it's a machine learning project).

I was adviced to combine all 3 binary files into one vcf (variant call format) file. The result (vcf file) is a 26G file using PLINK software. There are python packages and codes for converting vcf files into pandas dataframes, but my remote system memory is limited (15 Gi). Due to the nature of the dataset, I can only work with university computers.

My question is, considering all my limitations, how do I convert my dataset into a dataframe that can be used in machine learning? Let me know if you need more details.

python frame VCF data PLINK genetic • 1.2k views
ADD COMMENT
0
Entering edit mode

use the chunksize option when you load your data to Pandas.

ADD REPLY
0
Entering edit mode
19 months ago

Pandas is notoriously slow at reading in large datasets

I would recommend an alternative like polars or datatable:

find more information here:

https://towardsdatascience.com/getting-started-with-the-polars-dataframe-library-6f9e1c014c5c

or something else from this list:

http://theautomatic.net/2021/10/09/faster-alternatives-to-pandas/

ADD COMMENT
0
Entering edit mode

Thank you so much for your input.

ADD REPLY

Login before adding your answer.

Traffic: 3443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6