I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the pairwiseAlignment function from the Biostrings package. This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:
Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern, : cannot allocate memory block of size 17179869183.7 Gb
This is appearing to be almost certainly a bug--people on stack overflow are saying it is likely some sort of 64-bit integer-overflow. (something I am not really familiar with).
Does anybody have suggestions for a way around this bug or a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.
Here is a code snippet that replicates the problem:
library("Biostrings") scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance genome = read.DNAStringSet(genome_file_name)[] #genome is a "DNAString" instance #qstart, qend, substart, subend are all from intial BLAST alignment step scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub) #that last line gives the error: #Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, : #cannot allocate memory block of size 17179869182.9 Gb
And here is my sessioninfo() from R if anybody is interested:
> sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale:  C/en_US.UTF-8/C/C/C/C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  RPostgreSQL_0.3-2 DBI_0.2-5 Biostrings_2.24.1 IRanges_1.14.4 BiocInstaller_1.4.7 Biobase_2.16.0  BiocGenerics_0.2.0 loaded via a namespace (and not attached):  stats4_2.15.1 tcltk_2.15.1 tools_2.15.1
I appreciate your help.