snp filtering for text file
1
0
Entering edit mode
4.7 years ago
evelyn ▴ 230

Hello,

I am trying to filter text file for missing and redundant markers in R. But R session gets aborted. I think this is because of the large file size because this code works on other small files. This file has SNP markers for 50 genotypes and is 25 Gb in size. So I want to do the same filtering for missing and redundant SNPs in the cluster and then import the file in R for further analysis. I am not sure how can I do it in the cluster.

Additionally, I think there is a problem in saving text file to RData file as well. I appreciate your time and effort for any help. Thank you!

snp R • 1.0k views
ADD COMMENT
0
Entering edit mode

Have you tried this using fread and data.table? I think it might work, because data.table is a lot more performant with tables in the millions-of-rows range of records than data.frame ever can be. If I recall correctly, fread even has streaming support so it won't need to store the entire file in memory.

ADD REPLY
0
Entering edit mode

Yes, I imported the file using data.table and fread. And then I used the above command to filter markers. But R session still gets aborted.

ADD REPLY
0
Entering edit mode

Can you edit your question and add the exact code you're using please? You might also want to check stackoverflow on how to enhance the performance of fread.

ADD REPLY
0
Entering edit mode

I have edited the question and I have added an additional problem in saving text file to RData file which I think could be the reason.

ADD REPLY
0
Entering edit mode

Side note: It's not good practice to use keywords as variable names, like you're doing with na.

ADD REPLY
2
Entering edit mode
4.7 years ago
zx8754 11k

Growing objects in a loop is not a good idea. I am guessing you are running out of memory, as R keeps copying the same object over and over again with increased size, try something like below: load data, manipulate, results are in a list, then bind the lists:

library(data.table)

res <- rbindlist(
  lapply(1:40, function(i){
    load(paste0("file", i, ".RData"))
    uni <- unique(uni)
    na <- rowSums( is.na (uni))
    uni[ -which(na > (0.05 * ncol( uni))), ]
    })
  )
ADD COMMENT
0
Entering edit mode

Thank you, @zx8754! I tried using your code. It filters the markers but gives an error at the end:

Error in rbindlist(lapply(1:40, function(i) { : 
  Item 1 of input is not a data.frame, data.table or list

My input files for the loop are RData files.

ADD REPLY
0
Entering edit mode

Rdata files contain R objects. zx8754 assumed that your RData files all contain data.frame objects named uni. If that's not the case, add a line to convert uni to a data.frame immediately after loading the RData files.

ADD REPLY
0
Entering edit mode

You need to ensure (maybe use if else) that that the last row returns a data.frame.

ADD REPLY
0
Entering edit mode

So rbindlist can not work with RData format? I could not find an easy way to convert RData to data.frame.

ADD REPLY
1
Entering edit mode

RData is a file format. data.frame is an object type. An RData file can contain multiple objects, and each object can be of one or more types. Please read up on these concepts. For the moment, use is.data.frame and as.data.frame to get your job done.

ADD REPLY

Login before adding your answer.

Traffic: 2281 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6