Question: Bug: Longer Pairwise Sequence Alignment In R
1
gravatar for threesam
6.7 years ago by
threesam10
threesam10 wrote:

I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the pairwiseAlignment function from the Biostrings package. This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:

Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern, : cannot allocate memory block of size 17179869183.7 Gb

This is appearing to be almost certainly a bug--people on stack overflow are saying it is likely some sort of 64-bit integer-overflow. (something I am not really familiar with).

Does anybody have suggestions for a way around this bug or a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.

Here is a code snippet that replicates the problem:

library("Biostrings")

scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance

genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance

#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance

curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error: 
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,  : 
#cannot allocate memory block of size 17179869182.9 Gb

And here is my sessioninfo() from R if anybody is interested:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RPostgreSQL_0.3-2   DBI_0.2-5           Biostrings_2.24.1   IRanges_1.14.4      BiocInstaller_1.4.7 Biobase_2.16.0     
[7] BiocGenerics_0.2.0 

loaded via a namespace (and not attached):
[1] stats4_2.15.1 tcltk_2.15.1  tools_2.15.1

I appreciate your help.

R bioconductor • 2.1k views
ADD COMMENTlink written 6.7 years ago by threesam10
0
gravatar for JC
6.7 years ago by
JC7.8k
Mexico
JC7.8k wrote:

Your error is because the sequences are too long, to perform a pairwise alignment of 2 sequences (A, B), you need a matrix of (n x m), n = length(A), m = length(B), which can overflow your memory. You can try:

a) Blast already perform the alignment, if you don't trust the indels, you can adjust the gap parameters.

b) Or you can try other aligner like Muscle.

c) Reduce the sequence size in your scaffold, taking (with some overlap) 1kb windows and doing the alignments.

ADD COMMENTlink written 6.7 years ago by JC7.8k

But 56K * 56K = 3.1Gb; even with overhead required to run this inside a scripting language, it doesn't make sense that 17,179,869,182.9 Gb is being asked for -- I don't understand why it would ever take 9 orders of magnitude more memory to carry this out.

ADD REPLYlink written 6.7 years ago by threesam10

It's a little more complicated than that, 3.1Gb * (bits to store each cell), the 17,179,869,182.9 Gb is because an error in estimate the total size, it's not your real size needed.

ADD REPLYlink written 6.7 years ago by JC7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 888 users visited in the last hour