Question

Bug: Longer Pairwise Sequence Alignment In R

1

Entering edit mode

11.6 years ago

threesam ▴ 10

I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the pairwiseAlignment function from the Biostrings package. This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:

Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern, : cannot allocate memory block of size 17179869183.7 Gb

This is appearing to be almost certainly a bug--people on stack overflow are saying it is likely some sort of 64-bit integer-overflow. (something I am not really familiar with).

Does anybody have suggestions for a way around this bug or a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.

Here is a code snippet that replicates the problem:

library("Biostrings")

scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance

genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance

#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance

curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error: 
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,  : 
#cannot allocate memory block of size 17179869182.9 Gb

And here is my sessioninfo() from R if anybody is interested:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RPostgreSQL_0.3-2   DBI_0.2-5           Biostrings_2.24.1   IRanges_1.14.4      BiocInstaller_1.4.7 Biobase_2.16.0     
[7] BiocGenerics_0.2.0 

loaded via a namespace (and not attached):
[1] stats4_2.15.1 tcltk_2.15.1  tools_2.15.1

I appreciate your help.

bioconductor r • 3.4k views

ADD COMMENT • link updated 11.6 years ago by JC 13k • written 11.6 years ago by threesam ▴ 10

score 0 · Answer 1 · 2012-09-12

0

Entering edit mode

11.6 years ago

JC 13k

Your error is because the sequences are too long, to perform a pairwise alignment of 2 sequences (A, B), you need a matrix of (n x m), n = length(A), m = length(B), which can overflow your memory. You can try:

a) Blast already perform the alignment, if you don't trust the indels, you can adjust the gap parameters.

b) Or you can try other aligner like Muscle.

c) Reduce the sequence size in your scaffold, taking (with some overlap) 1kb windows and doing the alignments.

ADD COMMENT • link 11.6 years ago by JC 13k

0

Entering edit mode

But 56K * 56K = 3.1Gb; even with overhead required to run this inside a scripting language, it doesn't make sense that 17,179,869,182.9 Gb is being asked for -- I don't understand why it would ever take 9 orders of magnitude more memory to carry this out.

ADD REPLY • link 11.6 years ago by threesam ▴ 10

0

Entering edit mode

It's a little more complicated than that, 3.1Gb * (bits to store each cell), the 17,179,869,182.9 Gb is because an error in estimate the total size, it's not your real size needed.

ADD REPLY • link 11.6 years ago by JC 13k