Compare Individual Nucleotides From Two Sequences
2
3
Entering edit mode
10.4 years ago
Joris Meys ▴ 130

I am looking at a simple, fast and vectorized way to compare two Bioconductor Biostring objects (DNAstrings to be exact) and check for each position whether the nucleotides are the same. Currently, I use following code, but I doubt (I hope...) this is the most convenient way:

compare.DNA <- function(x,y){
sapply(seq(length(x)),
function(i){
x[i] == y[i]
}
)
}


This gives:

require(Biostring)
DNAseq1 <- DNAString('ACCTCAGCT')
DNAseq2 <- DNAString('ATCTCATCT')
compare.DNA(DNAseq1,DNAseq2)
[1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE


I've been trying to access the values directly, but there's no slot in Biostrings that contain the actual information, it just contains a link to a cached object. The problem is I have to do this on +3 million sequences, and this code runs slow:

> system.time(replicate(10000,compare.DNA(DNAseq1,DNAseq2)))
user  system elapsed
51.05    0.16   51.21


Any ideas on how to do this a whole lot faster?

r bioconductor sequence comparison • 6.8k views
3
Entering edit mode
10.4 years ago
Joris Meys ▴ 130

Michael Dondrup led me to investigate the conversions of DNAstring to basic R objects a bit further, and looking at the internal methods I figured out that converting to integer would be the fastest (apparently, that's how sequences are stored in the allocated memory).

Hence :

compare.DNA2 <- function(x,y){
as.integer(x) == as.integer(y)
}


is the fastest solution I could find. Compared to my other solution and the code of Michael :

require(rbenchmark)
benchmark(
compare.DNA(DNAseq1,DNAseq2),
compare.DNA2(DNAseq1,DNAseq2),
comp.string.fast(DNAseq1,DNAseq2),
columns=c('test','replications','elapsed','relative'),
replications=10000,
order = 'relative'
)


Gives :

                                test replications elapsed  relative
2     compare.DNA2(DNAseq1, DNAseq2)        10000    0.61  1.000000
3 comp.string.fast(DNAseq1, DNAseq2)        10000    0.95  1.557377
1      compare.DNA(DNAseq1, DNAseq2)        10000   51.13 83.819672


As Michael Dondrup rightfully pointed out, using DNAStringSet objects makes more sense. as.integer() doesn't work on those, so see his answer for some other options.

0
Entering edit mode

This is true in the artificial case that the sequences are in single DNAString objects, but you cannot convert a DNAStringSet which you will likely have to integer using as.integer.

0
Entering edit mode

@Michael Dondrup. I know. That's why I specifically asked for DNAstring methods, as I don't have a DNAStringSet.

0
Entering edit mode

ok, was just wondering how you read your sequences into R, or do you generate them in R? Because in most cases I would expect read.DNAStringSet be used to read a fasta file.

0
Entering edit mode

@Michael Dondrup I get them from another lab, so I have a (pretty odd) list of DNAStrings. Converting that to a DNAStringSet isn't that obvious.

0
Entering edit mode

as.raw should be faster still; the folks on the Bioconductor mailing list http://bit.ly/x1U1ux might have other suggestions (especially if the 'real' data has some particular characteristics, e.g., a few long or many short sequences).

0
Entering edit mode

@MartinMorgan I tested it, and that's actually not the case. as.raw first calls as.integer to get the data out...

0
Entering edit mode

@MartinMorgan I tested it, and that's actually not the case. as.raw first calls as.integer to get the data out... Thanks for the tip about the mailing list though.

2
Entering edit mode
10.4 years ago

Something like this?

>  system.time(replicate(1000000, comp.string.fast(DNAseq1,DNAseq2)))
user  system elapsed
150.365   1.007 151.681


And here is the function:

comp.string.fast =  strsplit(as.character(DNAseq1),"")[[1]] == strsplit(as.character(DNAseq2),"")[[1]]


Or even better on 'real' data:

# generate data as DNAStringSet, that's what you are most likely going to have:
str1 = rep(DNAStringSet(DNAseq1),1e4)
str2 = rep(DNAStringSet(DNAseq2),1e4)
comp.strings.even.faster <- function (str1, str2) {
tmp = cbind (strsplit(as.character(str1), ""), strsplit(as.character(str2), ""))
t(apply (tmp, 1, function(x){x[[1]]==x[[2]]}))
}
system.time(comp.strings.even.faster(str1,str2))
user  system elapsed
0.131   0.005   0.136

system.time(replicate(10000,compare.DNA2(DNAseq1,DNAseq2)))
user  system elapsed
0.719   0.001   0.721


Which shows two things:

1. I need a new computer,my laptop is so slow
2. if the data is in the right format, vectorized functions in R beat repeated function call, of course given you have a DNAStringSet (which is not the case for you, but anyway)

Edit: fastest that far, and most simple:

comp.very.fast <- function(x,y){ as.matrix(x) == as.matrix(y) }

system.time(comp.very.fast(str1, str2))
user  system elapsed
0.027   0.001   0.029

0
Entering edit mode

You give the result instead of the function in the code, but that's a detail. Thx for the info. I didn't try that one as I believed the call to strsplit to slow down everything, but it doesn't. It led me to the fastest solution (see my extra answer).

0
Entering edit mode

fixed..........

0
Entering edit mode

I need to look a bit deeper into the DNAStringSet, as that allows very interesting code. On a sidenote, you might want to add following function to the test : comp.very.fast <- function(x,y){ as.matrix(x) == as.matrix(y) }. Does the exactly the same as your comp.strings.even.faster, but outperforms it by a factor 3. Just add it to the answer, I'll give you the rep ;)

0
Entering edit mode

Wow, who could expect that as.matrix works on DNAStringSet, but it does. Amazing