Looking For A Light-Weight Command-Line Tool To Compute Pairwise Sequence Distances
Entering edit mode
11.0 years ago
Mike ▴ 80

I've been asked to analyze a very large dataset of DNA barcodes (short 600-ish bp COI gene sequences) to compare INTERspecific vs. INTRAspecific sequence distances. I've had some luck using fdnadist (the EMBOSS version of Phylip's dnadist) and the APE package in R to generate distance matrices, and then parsing the output. Unfortunately, as the number of sequences grows, the generation of distance matrices becomes memory-prohibitive. I'm looking for a way to loop through the set of sequences and compute distance measures on the fly, and store these numbers either in a database or text file, to avoid the memory problem.

Is there a good tool that can generate this distance measure between 2 DNA sequences?

sequence-analysis barcode • 2.7k views
Entering edit mode
11.0 years ago
David W 4.9k

Hi Mike,

There will definitely be faster ways to do this, but here's an approach I've done for similar problems (modified to fit your requirements):

write_pw <- function(seqs, fname, subs_model="p", print_every=1000){
    indices <- combn(dim(seqs)[1], 2)
    out_file = file(fname, "w")
    writeLines("sp1, sp2, dist", con=out_file, sep="\n")

    per_line <- function(i){
        d <- dist.dna(seqs[i,], model=subs_model)
        writeLines( paste( c(labels(d), d), collapse=", "), con=out_file, sep="\n")

    ncomps <- dim(indices)[2]
    for (col in 1:ncomps) {    
        if (col %% print_every == 0){
            cat("\r", paste(col, "of", ncomps, "comparisons completed"))


This will get round the memory problem, by slicing out just the two sequences you want to compare, doing the comparison, writing the results and moving on. But it's still R, so it's still going to relatively be slow. 5 mins to run a thousand sequences on my crappy old desktop

> big_data <- woodmouse[sample(1:dim(woodmouse)[1], 1000, replace=TRUE), ]
> dim(big_data)
[1] 1000  965
> system.time(write_pw(big_data, "test.csv"))
   user  system elapsed 
 316.61    0.56  330.69
Entering edit mode
11.0 years ago
Lee Katz ★ 3.1k

You basically have a memory issue, and the pairwise distances are kind of a red herring in this question. How can you keep calculating these values without slowing down and running out of memory? A database. I think that your main easy choices are BerkeleyDB and Sqlite3. BerkeleyDB has a kind of interesting, easy to use Perl interface, so I'd suggest that.


# warning: I have not personally tried this code
use BerkeleyDB;
my $berkeleydb=tie %db, "BerkeleyDB::Hash", -Filename=>"file.db3",-Flags=>DB_CREATE;
my $distance=dnadistance($seq1,$seq2);
$db{$seq1Id}{$seq2Id}=$distance; # automagically puts this into a database

Login before adding your answer.

Traffic: 1640 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6