R error: vector memory exhausted
1
0
Entering edit mode
4.6 years ago
evelyn ▴ 230

I split a large vcf file with multiple samples to small vcf.gz files using a loop. I am trying to convert these small vcf.gz files to RData files using a loop but I get an error of vector memory loss. The original file is 35 GB in size. The code works well for an original file of size 28 GB. I am using R version 3.6.0 on Mac with 16 GB RAM.

sequence <- round(seq(1,total.lines,length.out = 51))
for(i in 2:51){
  print(i)
  system(paste0("sed -n '",sequence[i-1],",",sequence[i],"'p input.vcf.gz > input",i-1,".vcf.gz"))
}

library(data.table)
for(i in 1:50){
  print(i)
  df<-fread(paste0("input",i,".vcf.gz"))
  if(i==1)
    COLs <- colnames(df)
  colnames(df) <- COLs
  save(df,file=paste0("input",i,".RData"))
}

To solve the issue of memory, I have tried

cd ~ 
touch .Renviron 
open .Renviron 
R_MAX_VSIZE=100Gb

But it did not solve the problem and R gets aborted with a fatal error. I want to make the RData files from vcf.gz files from file of 35 GB size for further analysis. Thank you for your help!

R • 3.9k views
ADD COMMENT
4
Entering edit mode

What is the final goal, so how is the output supposed to look like? I am 99.9999% sure there is a simple command-line / bcftools-way of doing this outside of R.

ADD REPLY
0
Entering edit mode

Once RData files are made, I filter those for missing and redundant markers using:

library(data.table)
all <- NULL
for(i in 1:50){
  load(paste0("input",i,".RData"))
  df <- unique(df)
  count.nas <- rowSumsis.na(df))
  df<-df[-which(count.nas>(0.10*ncol(df))),]
  all <- rbind(all, df)
}
write.table(all, file = "output.txt", sep="\t")

Then I use this text file for making DAPC graphs in R. So my final goal here is to get a text file filtered for missing and redundant markers which can be used in R for DAPC. And I like to try various threshold values to filter out missing marker information to choose the best one. Thank you!

ADD REPLY
0
Entering edit mode

Please show a representative data example that indicates how the output should look like. You should not expect people to dig into your code to figure out how things might look like. Simple make a short representative input and output and paste that here.

ADD REPLY
0
Entering edit mode

I apologize for any confusion. I am using an input text file (merged vcf from multiple samples). For illustration, I kept four samples here. A few lines from the original input file which has a size of 35 GB looks like:

Sample_1    Sample_2    Sample_3    Sample_4
NA  NA  NA  NA
A   A   T   G
T   T   T   T
T   T   T   T
R   R   R   R
Y   Y   Y   Y
NA  NA  NA  NA
S   S   S   S

So I want to filter out NA which represents missing data by keeping a threshold such as 5% of missing data allowed and any redundant markers e.g., I want to keep only one copy of T T T T and filter the duplicates. So these lines for the final file looks like:

Sample_1    Sample_2    Sample_3    Sample_4
A   A   T   G
T   T   T   T
R   R   R   R
Y   Y   Y   Y
S   S   S   S
ADD REPLY
0
Entering edit mode

Assume your data file is called dat, will the following code work? (Replace 0.05 with threshold you want)

filter <- apply(dat, 1, function(x){sum ( is.na(x))/length(x) >= 0.05; })
result <- unique(dat[!filter])
ADD REPLY
0
Entering edit mode

Thank you, @Sam! The first issue I had with my data file was its big size. R freezes or aborts if I try to import the file directly. That's why I split it into the smaller parts and combined later after filtering.

ADD REPLY
0
Entering edit mode

you can in theory run the above script per small part of the file. Then you can combine the data and do a unique again in the very end

ADD REPLY
2
Entering edit mode
4.6 years ago

I wouldn't use R for the filtering because the default working philosophy of R is to load everything into memory before doing anything to a given object.

I'd probably try to filter before reading into R, e.g. using standard command line tools:

$ cat test.txt
Sample_1    Sample_2    Sample_3    Sample_4
A  NA  NA  NA
A   A   T   G
T   T   T   T
T   T   T   T
R   R   R   R
Y   Y   Y   Y
NA  NA  B B
S   S   S   S

$ awk '{c=0;for(i=1;i<=NF;i++){ if($i ~ /NA/)c++; } OFS = "\t";print c/NF, $0}' test.txt | awk '$1 < 0.75' | cut -f 2-
Sample_1    Sample_2    Sample_3    Sample_4
A   A   T   G
T   T   T   T
T   T   T   T
R   R   R   R
Y   Y   Y   Y
NA  NA  B B
S   S   S   S

In this case, I've removed the line with 3/4 NA's (`$1 < 0.75'), but you can change that value, of course.

EDIT: the first awk bit counts the number of NA per line, stores that values in the variable c, which, mostly for sake of clarity, add as an extra column on which I then subsequently filter.

ADD COMMENT
0
Entering edit mode

in addition, you may want to check out the vcftools suite, they may already provide a function that does exactly what you want.

ADD REPLY
0
Entering edit mode

Thank you, @Friederike! I have tried your awk statement and it filters out the 'NA's. But I have a confusion about how it filters:

'$1 < 0.95'  resulted in 35736361 lines
'$1 < 0.90'  resulted in 17351161 lines
'$1 < 0.80'  resulted in  7030538 lines

However, '$1 < 0.95' should result in the least number of lines among these thresholds as it is allowing only 5% of NA's. I want to keep only 5% of missing data for each line and filter out the rest but I am not sure why it is showing the reverse pattern.

ADD REPLY
0
Entering edit mode

Don't you want #NA / all <= 0.05 then?

ADD REPLY
0
Entering edit mode

Oh yes, it worked. Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6