How to salvage incomplete PLINK .bed files?
2
0
Entering edit mode
2.2 years ago
bspitzer ▴ 10

I converted a number of large SeqArray (.gds) files into PLINK binary format, but when the .bed files were written to storage, it appears that many of them became truncated. I would like to salvage as much data as possible from these truncated files.

According to the documentation, based on the number of bytes in the truncated file, it should be possible to figure out how many variants the file contains. Snipping off any partial block at the end of the .bed file and truncating the .bim file to match the truncated .bed file should yield a set of PLINK binary files with the correct information for at least those variants.

Has anyone out there tried to salvage an incomplete .bed file like this? Does anyone have the code to do this? It would save a lot of effort if I could salvage data from these files, but I'm not confident in my ability to do it correctly on my own.

PLINK bash • 1.1k views
ADD COMMENT
2
Entering edit mode
2.2 years ago
bspitzer ▴ 10

... It turned out to be easy to code a solution. I did it in R, but it would probably be more efficient as a bash script.

# get the number of samples
ns <- as.numeric(strsplit(system("wc -l incomplete.fam", intern = T), split = " ")[[1]][1])
#   from this, calculate block size
block_bytes <- ceiling(ns/4)
#   get the total number of variants
all_vars <- as.numeric(strsplit(system("wc -l incomplete.bim", intern = T), split = " ")[[1]][1])
#   subtract three from the actual number of bytes in .bed file
curr_bytes <- as.numeric(strsplit(system("wc -c incomplete.bed", intern = T), split = " ")[[1]][1])
data_bytes <- curr_bytes - 3
#   get the number of blocks with complete data; also get the remainder
nv <- floor(data_bytes/block_bytes)
rem <- data_bytes - (nv*block_bytes)
if (curr_bytes < (block_bytes*all_vars)+3){
  system(paste0("truncate --size=-", rem, " incomplete.bed"))
  #   cut the .bim file
  system(paste0("cat incomplete.bim | head -n ", nv, " > temp.bim"))
  system("mv temp.bim incomplete.bim")
}
ADD COMMENT
0
Entering edit mode

(I apologize for accepting my own solution. It feels as though it's in poor taste to do so.)

ADD REPLY
0
Entering edit mode

That is OK in this case. If @LChart's answer helped you get there then you could accept that answer as well.

ADD REPLY
0
Entering edit mode
2.2 years ago
LChart 5.1k

One of the easiest things to do would be to grab an available bed parser and modify the logic to allow truncated .bed files. For instance:

https://github.com/fastlmm/bed-reader/blob/master/bed_reader/_open_bed.py#L435

will fail because it attempts to read past the EOF, as it's iterating through the bim and fam file simultaneously with the bed. You can modify this to accumulate bim/fam entries as its iterating, catching the error thrown by filepointer.read, and return the data read until that point.

ADD COMMENT

Login before adding your answer.

Traffic: 3087 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6