Question: snp filtering for text file
0
gravatar for evelyn
20 days ago by
evelyn30
evelyn30 wrote:

Hello,

I am trying to filter text file for missing and redundant markers in R. But R session gets aborted. I think this is because of the large file size because this code works on other small files. This file has SNP markers for 50 genotypes and is 25 Gb in size. So I want to do the same filtering for missing and redundant SNPs in the cluster and then import the file in R for further analysis. I am not sure how can I do it in the cluster.

Additionally, I think there is a problem in saving text file to RData file as well. I appreciate your time and effort for any help. Thank you!

snp R • 203 views
ADD COMMENTlink modified 12 days ago • written 20 days ago by evelyn30

Have you tried this using fread and data.table? I think it might work, because data.table is a lot more performant with tables in the millions-of-rows range of records than data.frame ever can be. If I recall correctly, fread even has streaming support so it won't need to store the entire file in memory.

ADD REPLYlink modified 20 days ago • written 20 days ago by RamRS24k

Yes, I imported the file using data.table and fread. And then I used the above command to filter markers. But R session still gets aborted.

ADD REPLYlink written 20 days ago by evelyn30

Can you edit your question and add the exact code you're using please? You might also want to check stackoverflow on how to enhance the performance of fread.

ADD REPLYlink written 20 days ago by RamRS24k

I have edited the question and I have added an additional problem in saving text file to RData file which I think could be the reason.

ADD REPLYlink written 19 days ago by evelyn30

Side note: It's not good practice to use keywords as variable names, like you're doing with na.

ADD REPLYlink written 19 days ago by RamRS24k
2
gravatar for zx8754
19 days ago by
zx87548.2k
London
zx87548.2k wrote:

Growing objects in a loop is not a good idea. I am guessing you are running out of memory, as R keeps copying the same object over and over again with increased size, try something like below: load data, manipulate, results are in a list, then bind the lists:

library(data.table)

res <- rbindlist(
  lapply(1:40, function(i){
    load(paste0("file", i, ".RData"))
    uni <- unique(uni)
    na <- rowSums( is.na (uni))
    uni[ -which(na > (0.05 * ncol( uni))), ]
    })
  )
ADD COMMENTlink modified 18 days ago • written 19 days ago by zx87548.2k

Thank you, @zx8754! I tried using your code. It filters the markers but gives an error at the end:

Error in rbindlist(lapply(1:40, function(i) { : 
  Item 1 of input is not a data.frame, data.table or list

My input files for the loop are RData files.

ADD REPLYlink modified 18 days ago • written 18 days ago by evelyn30

Rdata files contain R objects. zx8754 assumed that your RData files all contain data.frame objects named uni. If that's not the case, add a line to convert uni to a data.frame immediately after loading the RData files.

ADD REPLYlink written 18 days ago by RamRS24k

You need to ensure (maybe use if else) that that the last row returns a data.frame.

ADD REPLYlink written 18 days ago by zx87548.2k

So rbindlist can not work with RData format? I could not find an easy way to convert RData to data.frame.

ADD REPLYlink written 18 days ago by evelyn30
1

RData is a file format. data.frame is an object type. An RData file can contain multiple objects, and each object can be of one or more types. Please read up on these concepts. For the moment, use is.data.frame and as.data.frame to get your job done.

ADD REPLYlink written 18 days ago by RamRS24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2018 users visited in the last hour