Question

snp filtering for text file

0

Entering edit mode

4.7 years ago

evelyn ▴ 230

Hello,

I am trying to filter text file for missing and redundant markers in R. But R session gets aborted. I think this is because of the large file size because this code works on other small files. This file has SNP markers for 50 genotypes and is 25 Gb in size. So I want to do the same filtering for missing and redundant SNPs in the cluster and then import the file in R for further analysis. I am not sure how can I do it in the cluster.

Additionally, I think there is a problem in saving text file to RData file as well. I appreciate your time and effort for any help. Thank you!

snp R • 1.0k views

ADD COMMENT • link 4.7 years ago by evelyn ▴ 230

0

Entering edit mode

Have you tried this using fread and data.table? I think it might work, because data.table is a lot more performant with tables in the millions-of-rows range of records than data.frame ever can be. If I recall correctly, fread even has streaming support so it won't need to store the entire file in memory.

ADD REPLY • link 4.7 years ago by Ram 43k

0

Entering edit mode

Yes, I imported the file using data.table and fread. And then I used the above command to filter markers. But R session still gets aborted.

ADD REPLY • link 4.7 years ago by evelyn ▴ 230

0

Entering edit mode

Can you edit your question and add the exact code you're using please? You might also want to check stackoverflow on how to enhance the performance of fread.

ADD REPLY • link 4.7 years ago by Ram 43k

0

Entering edit mode

I have edited the question and I have added an additional problem in saving text file to RData file which I think could be the reason.

ADD REPLY • link 4.7 years ago by evelyn ▴ 230

0

Entering edit mode

Side note: It's not good practice to use keywords as variable names, like you're doing with na.

ADD REPLY • link 4.7 years ago by Ram 43k

score 2 · Answer 1 · 2019-09-04

2

Entering edit mode

4.7 years ago

zx8754 11k

Growing objects in a loop is not a good idea. I am guessing you are running out of memory, as R keeps copying the same object over and over again with increased size, try something like below: load data, manipulate, results are in a list, then bind the lists:

library(data.table)

res <- rbindlist(
  lapply(1:40, function(i){
    load(paste0("file", i, ".RData"))
    uni <- unique(uni)
    na <- rowSums( is.na (uni))
    uni[ -which(na > (0.05 * ncol( uni))), ]
    })
  )

ADD COMMENT • link 4.7 years ago by zx8754 11k

0

Entering edit mode

Thank you, @zx8754! I tried using your code. It filters the markers but gives an error at the end:

Error in rbindlist(lapply(1:40, function(i) { : 
  Item 1 of input is not a data.frame, data.table or list

My input files for the loop are RData files.

ADD REPLY • link 4.7 years ago by evelyn ▴ 230

0

Entering edit mode

Rdata files contain R objects. zx8754 assumed that your RData files all contain data.frame objects named uni. If that's not the case, add a line to convert uni to a data.frame immediately after loading the RData files.

ADD REPLY • link 4.7 years ago by Ram 43k

0

Entering edit mode

You need to ensure (maybe use if else) that that the last row returns a data.frame.

ADD REPLY • link 4.7 years ago by zx8754 11k

0

Entering edit mode

So rbindlist can not work with RData format? I could not find an easy way to convert RData to data.frame.

ADD REPLY • link 4.7 years ago by evelyn ▴ 230

1

Entering edit mode

RData is a file format. data.frame is an object type. An RData file can contain multiple objects, and each object can be of one or more types. Please read up on these concepts. For the moment, use is.data.frame and as.data.frame to get your job done.

ADD REPLY • link 4.7 years ago by Ram 43k