I am trying to create a support vector regression model that assigns a score to a DNA sequence. I am using the kebabs software from bioconductor (https://bioconductor.org/packages/release/bioc/html/kebabs.html). Here is my code:
library(kebabs) library(Biostrings) fastas = readDNAStringSet('train.fa') scores = read.csv('train_scores.csv', sep = '\t', header = FALSE) specKlin = spectrumKernel(k = 5:7, distWeight = linWeight(sigma = 72)) specKexp = spectrumKernel(k = 5:7, distWeight = expWeight(sigma = 72)) allspecK = c(specKlin, specKexp) nus = c(.5, .6, .7, .8) model <- kbsvm(x = fastas, y = scores, kernel = allspecK, pkg = 'e1071', svm = 'nu-svr', nu = nus, showProgress = TRUE)
However, I get this output:
Grid Search Progress: Kernel_1 Error: cannot allocate vector of size 1557.2 Gb In addition: Warning message: grid search without cross validation (cross=0)
My fastas file has more than 400,000 DNA sequences, and the code only works when I use about 10,000 or fewer of the sequences. Even changing the parameters like k or training one model instead of grid searching results in the same error when I use the full fastas file. Anyway of working around this error?