How to convert PLINK outputs into Parquet files?

1

Entering edit mode

3.4 years ago

telroyjatter ▴ 230

Hello,

I'm working with PLINK files, trying to create a matrix where rows are samples, columns are mutations, and values are numbers of minor alleles (i.e., 0, 1, or 2). In my dataset there are ~2000 samples and ~6M SNPs. I used recodeAD on my bed file to generate a .raw file, and I used the python package pandas to import it, but I couldn't even import the first row because it has 6M columns and I ran out of memory. I used recode A-transpose to generate a .traw file that I could import the first 10 rows very easily (since the dimensionality now has 2000 columns instead of 6M), but importing the full file is proving tedious. I used a loop to import chunks using the chunksize parameter, but this is taking incredibly long to run.

I normally use the parquet file format when I work with large datasets because it loads column-wise and is very fast, takes up less disk, and is less of a load on RAM. Is there a way to convert the .raw or .traw files into parquet format without having to load them into RAM?

plink parquet python R • 654 views

ADD COMMENT • link 3.4 years ago by telroyjatter ▴ 230

Login before adding your answer.