Question: snp filtering for text file
0
gravatar for evelyn
11 months ago by
evelyn100
evelyn100 wrote:

Hello,

I am trying to filter text file for missing and redundant markers in R. But R session gets aborted. I think this is because of the large file size because this code works on other small files. This file has SNP markers for 50 genotypes and is 25 Gb in size. So I want to do the same filtering for missing and redundant SNPs in the cluster and then import the file in R for further analysis. I am not sure how can I do it in the cluster.

Additionally, I think there is a problem in saving text file to RData file as well. I appreciate your time and effort for any help. Thank you!

snp R • 352 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by evelyn100

Have you tried this using fread and data.table? I think it might work, because data.table is a lot more performant with tables in the millions-of-rows range of records than data.frame ever can be. If I recall correctly, fread even has streaming support so it won't need to store the entire file in memory.

ADD REPLYlink modified 11 months ago • written 11 months ago by RamRS28k

Yes, I imported the file using data.table and fread. And then I used the above command to filter markers. But R session still gets aborted.

ADD REPLYlink written 11 months ago by evelyn100

Can you edit your question and add the exact code you're using please? You might also want to check stackoverflow on how to enhance the performance of fread.

ADD REPLYlink written 11 months ago by RamRS28k

I have edited the question and I have added an additional problem in saving text file to RData file which I think could be the reason.

ADD REPLYlink written 11 months ago by evelyn100

Side note: It's not good practice to use keywords as variable names, like you're doing with na.

ADD REPLYlink written 11 months ago by RamRS28k
2
gravatar for zx8754
11 months ago by
zx87549.4k
London
zx87549.4k wrote:

Growing objects in a loop is not a good idea. I am guessing you are running out of memory, as R keeps copying the same object over and over again with increased size, try something like below: load data, manipulate, results are in a list, then bind the lists:

library(data.table)

res <- rbindlist(
  lapply(1:40, function(i){
    load(paste0("file", i, ".RData"))
    uni <- unique(uni)
    na <- rowSums( is.na (uni))
    uni[ -which(na > (0.05 * ncol( uni))), ]
    })
  )
ADD COMMENTlink modified 11 months ago • written 11 months ago by zx87549.4k

Thank you, @zx8754! I tried using your code. It filters the markers but gives an error at the end:

Error in rbindlist(lapply(1:40, function(i) { : 
  Item 1 of input is not a data.frame, data.table or list

My input files for the loop are RData files.

ADD REPLYlink modified 11 months ago • written 11 months ago by evelyn100

Rdata files contain R objects. zx8754 assumed that your RData files all contain data.frame objects named uni. If that's not the case, add a line to convert uni to a data.frame immediately after loading the RData files.

ADD REPLYlink written 11 months ago by RamRS28k

You need to ensure (maybe use if else) that that the last row returns a data.frame.

ADD REPLYlink written 11 months ago by zx87549.4k

So rbindlist can not work with RData format? I could not find an easy way to convert RData to data.frame.

ADD REPLYlink written 11 months ago by evelyn100
1

RData is a file format. data.frame is an object type. An RData file can contain multiple objects, and each object can be of one or more types. Please read up on these concepts. For the moment, use is.data.frame and as.data.frame to get your job done.

ADD REPLYlink written 11 months ago by RamRS28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 956 users visited in the last hour