I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the pairwiseAlignment function from the Biostrings package. This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:
Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern, : cannot allocate memory block of size 17179869183.7 Gb
This is appearing to be almost certainly a bug--people on stack overflow are saying it is likely some sort of 64-bit integer-overflow. (something I am not really familiar with).
Does anybody have suggestions for a way around this bug or a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.
Here is a code snippet that replicates the problem:
library("Biostrings")
scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance
genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance
#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance
curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error:
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
#cannot allocate memory block of size 17179869182.9 Gb
And here is my sessioninfo() from R if anybody is interested:
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RPostgreSQL_0.3-2 DBI_0.2-5 Biostrings_2.24.1 IRanges_1.14.4 BiocInstaller_1.4.7 Biobase_2.16.0
[7] BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] stats4_2.15.1 tcltk_2.15.1 tools_2.15.1
I appreciate your help.
But 56K * 56K = 3.1Gb; even with overhead required to run this inside a scripting language, it doesn't make sense that 17,179,869,182.9 Gb is being asked for -- I don't understand why it would ever take 9 orders of magnitude more memory to carry this out.
It's a little more complicated than that, 3.1Gb * (bits to store each cell), the 17,179,869,182.9 Gb is because an error in estimate the total size, it's not your real size needed.