Question

Looking For A Light-Weight Command-Line Tool To Compute Pairwise Sequence Distances

3

Entering edit mode

12.4 years ago

Mike ▴ 80

I've been asked to analyze a very large dataset of DNA barcodes (short 600-ish bp COI gene sequences) to compare INTERspecific vs. INTRAspecific sequence distances. I've had some luck using fdnadist (the EMBOSS version of Phylip's dnadist) and the APE package in R to generate distance matrices, and then parsing the output. Unfortunately, as the number of sequences grows, the generation of distance matrices becomes memory-prohibitive. I'm looking for a way to loop through the set of sequences and compute distance measures on the fly, and store these numbers either in a database or text file, to avoid the memory problem.

Is there a good tool that can generate this distance measure between 2 DNA sequences?

sequence-analysis barcode • 3.1k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 12.4 years ago by Mike ▴ 80

score 1 · Answer 1 · 2013-02-21

Hi Mike,

There will definitely be faster ways to do this, but here's an approach I've done for similar problems (modified to fit your requirements):

write_pw <- function(seqs, fname, subs_model="p", print_every=1000){
    on.exit(cat("\n"))
    indices <- combn(dim(seqs)[1], 2)
    out_file = file(fname, "w")
    writeLines("sp1, sp2, dist", con=out_file, sep="\n")

    per_line <- function(i){
        d <- dist.dna(seqs[i,], model=subs_model)
        writeLines( paste( c(labels(d), d), collapse=", "), con=out_file, sep="\n")
    }

    ncomps <- dim(indices)[2]
    for (col in 1:ncomps) {    
        per_line(indices[,col])
        if (col %% print_every == 0){
            cat("\r", paste(col, "of", ncomps, "comparisons completed"))
        }
    }

    close(out_file)
}

This will get round the memory problem, by slicing out just the two sequences you want to compare, doing the comparison, writing the results and moving on. But it's still R, so it's still going to relatively be slow. 5 mins to run a thousand sequences on my crappy old desktop

>data(woodmouse)
> big_data <- woodmouse[sample(1:dim(woodmouse)[1], 1000, replace=TRUE), ]
> dim(big_data)
[1] 1000  965
> system.time(write_pw(big_data, "test.csv"))
   user  system elapsed 
 316.61    0.56  330.69

score 1 · Answer 2 · 2013-02-21

You basically have a memory issue, and the pairwise distances are kind of a red herring in this question. How can you keep calculating these values without slowing down and running out of memory? A database. I think that your main easy choices are BerkeleyDB and Sqlite3. BerkeleyDB has a kind of interesting, easy to use Perl interface, so I'd suggest that.

http://search.cpan.org/~pmqs/BerkeleyDB-0.51/BerkeleyDB.pod

# warning: I have not personally tried this code
use BerkeleyDB;
my(%db);
my $berkeleydb=tie %db, "BerkeleyDB::Hash", -Filename=>"file.db3",-Flags=>DB_CREATE;
my $distance=dnadistance($seq1,$seq2);
$db{$seq1Id}{$seq2Id}=$distance; # automagically puts this into a database