How to convert PLINK outputs into Parquet files?
0
1
Entering edit mode
3.4 years ago
telroyjatter ▴ 230

Hello,

I'm working with PLINK files, trying to create a matrix where rows are samples, columns are mutations, and values are numbers of minor alleles (i.e., 0, 1, or 2). In my dataset there are ~2000 samples and ~6M SNPs. I used recodeAD on my bed file to generate a .raw file, and I used the python package pandas to import it, but I couldn't even import the first row because it has 6M columns and I ran out of memory. I used recode A-transpose to generate a .traw file that I could import the first 10 rows very easily (since the dimensionality now has 2000 columns instead of 6M), but importing the full file is proving tedious. I used a loop to import chunks using the chunksize parameter, but this is taking incredibly long to run.

I normally use the parquet file format when I work with large datasets because it loads column-wise and is very fast, takes up less disk, and is less of a load on RAM. Is there a way to convert the .raw or .traw files into parquet format without having to load them into RAM?

plink parquet python R • 654 views
ADD COMMENT

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6