Question: R error: vector memory exhausted
0
gravatar for evelyn
11 days ago by
evelyn30
evelyn30 wrote:

I split a large vcf file with multiple samples to small vcf.gz files using a loop. I am trying to convert these small vcf.gz files to RData files using a loop but I get an error of vector memory loss. The original file is 35 GB in size. The code works well for an original file of size 28 GB. I am using R version 3.6.0 on Mac with 16 GB RAM.

sequence <- round(seq(1,total.lines,length.out = 51))
for(i in 2:51){
  print(i)
  system(paste0("sed -n '",sequence[i-1],",",sequence[i],"'p input.vcf.gz > input",i-1,".vcf.gz"))
}

library(data.table)
for(i in 1:50){
  print(i)
  df<-fread(paste0("input",i,".vcf.gz"))
  if(i==1)
    COLs <- colnames(df)
  colnames(df) <- COLs
  save(df,file=paste0("input",i,".RData"))
}

To solve the issue of memory, I have tried

cd ~ 
touch .Renviron 
open .Renviron 
R_MAX_VSIZE=100Gb

But it did not solve the problem and R gets aborted with a fatal error. I want to make the RData files from vcf.gz files from file of 35 GB size for further analysis. Thank you for your help!

R • 152 views
ADD COMMENTlink modified 10 days ago by Friederike5.1k • written 11 days ago by evelyn30
4

What is the final goal, so how is the output supposed to look like? I am 99.9999% sure there is a simple command-line / bcftools-way of doing this outside of R.

ADD REPLYlink written 11 days ago by ATpoint23k

Once RData files are made, I filter those for missing and redundant markers using:

library(data.table)
all <- NULL
for(i in 1:50){
  load(paste0("input",i,".RData"))
  df <- unique(df)
  count.nas <- rowSumsis.na(df))
  df<-df[-which(count.nas>(0.10*ncol(df))),]
  all <- rbind(all, df)
}
write.table(all, file = "output.txt", sep="\t")

Then I use this text file for making DAPC graphs in R. So my final goal here is to get a text file filtered for missing and redundant markers which can be used in R for DAPC. And I like to try various threshold values to filter out missing marker information to choose the best one. Thank you!

ADD REPLYlink written 11 days ago by evelyn30

Please show a representative data example that indicates how the output should look like. You should not expect people to dig into your code to figure out how things might look like. Simple make a short representative input and output and paste that here.

ADD REPLYlink written 11 days ago by ATpoint23k

I apologize for any confusion. I am using an input text file (merged vcf from multiple samples). For illustration, I kept four samples here. A few lines from the original input file which has a size of 35 GB looks like:

Sample_1    Sample_2    Sample_3    Sample_4
NA  NA  NA  NA
A   A   T   G
T   T   T   T
T   T   T   T
R   R   R   R
Y   Y   Y   Y
NA  NA  NA  NA
S   S   S   S

So I want to filter out NA which represents missing data by keeping a threshold such as 5% of missing data allowed and any redundant markers e.g., I want to keep only one copy of T T T T and filter the duplicates. So these lines for the final file looks like:

Sample_1    Sample_2    Sample_3    Sample_4
A   A   T   G
T   T   T   T
R   R   R   R
Y   Y   Y   Y
S   S   S   S
ADD REPLYlink written 11 days ago by evelyn30

Assume your data file is called dat, will the following code work? (Replace 0.05 with threshold you want)

filter <- apply(dat, 1, function(x){sum ( is.na(x))/length(x) >= 0.05; })
result <- unique(dat[!filter])
ADD REPLYlink modified 11 days ago • written 11 days ago by Sam2.4k

Thank you, @Sam! The first issue I had with my data file was its big size. R freezes or aborts if I try to import the file directly. That's why I split it into the smaller parts and combined later after filtering.

ADD REPLYlink written 10 days ago by evelyn30

you can in theory run the above script per small part of the file. Then you can combine the data and do a unique again in the very end

ADD REPLYlink written 10 days ago by Sam2.4k
2
gravatar for Friederike
10 days ago by
Friederike5.1k
United States
Friederike5.1k wrote:

I wouldn't use R for the filtering because the default working philosophy of R is to load everything into memory before doing anything to a given object.

I'd probably try to filter before reading into R, e.g. using standard command line tools:

$ cat test.txt
Sample_1    Sample_2    Sample_3    Sample_4
A  NA  NA  NA
A   A   T   G
T   T   T   T
T   T   T   T
R   R   R   R
Y   Y   Y   Y
NA  NA  B B
S   S   S   S

$ awk '{c=0;for(i=1;i<=NF;i++){ if($i ~ /NA/)c++; } OFS = "\t";print c/NF, $0}' test.txt | awk '$1 < 0.75' | cut -f 2-
Sample_1    Sample_2    Sample_3    Sample_4
A   A   T   G
T   T   T   T
T   T   T   T
R   R   R   R
Y   Y   Y   Y
NA  NA  B B
S   S   S   S

In this case, I've removed the line with 3/4 NA's (`$1 < 0.75'), but you can change that value, of course.

EDIT: the first awk bit counts the number of NA per line, stores that values in the variable c, which, mostly for sake of clarity, add as an extra column on which I then subsequently filter.

ADD COMMENTlink modified 10 days ago • written 10 days ago by Friederike5.1k

in addition, you may want to check out the vcftools suite, they may already provide a function that does exactly what you want.

ADD REPLYlink written 10 days ago by Friederike5.1k

Thank you, @Friederike! I have tried your awk statement and it filters out the 'NA's. But I have a confusion about how it filters:

'$1 < 0.95'  resulted in 35736361 lines
'$1 < 0.90'  resulted in 17351161 lines
'$1 < 0.80'  resulted in  7030538 lines

However, '$1 < 0.95' should result in the least number of lines among these thresholds as it is allowing only 5% of NA's. I want to keep only 5% of missing data for each line and filter out the rest but I am not sure why it is showing the reverse pattern.

ADD REPLYlink modified 10 days ago • written 10 days ago by evelyn30

Don't you want #NA / all <= 0.05 then?

ADD REPLYlink written 10 days ago by Friederike5.1k

Oh yes, it worked. Thank you so much!

ADD REPLYlink written 9 days ago by evelyn30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 903 users visited in the last hour