Question: Code Golf: Mean Length Of Fasta Sequences
20
gravatar for Eric Normandeau
3.8 years ago by
Eric Normandeau7.1k
Quebec, Canada
Eric Normandeau7.1k wrote:

Hi,

I was about to re-invent the wheel, again, when I thought that there are probably many among you who had already solved this problem.

I have a few huge fasta files (half a million sequences) and I would like to know the average length of the sequences for each of these files, one at a time.

So, let's call this a code golf. The goal is to make a very short code to accomplish the following:

Given a fasta file, return the average length of the sequences.

The correct answer will go to the answer with the most votes on friday around 16h in Quebec (Eastern Time).

You can use anything that can be run on a linux terminal (your favorite language, emboss, awk...), diversity will be appreciated :)

Cheers

CONCLUSION:

And the correct answer goes to the most voted question, as promised :) Thank you all for your great participation! I think I am not the only one to have appreciated the great diversity in the answers, as well as very interesting discussions and ideas.

EDIT:

Although the question is now closed, do not hesitate to vote for all the answers you found interesting, especially those at the bottom! They are not necessarily less interesting. They often have only arrived later :)

Thanks again!

ADD COMMENTlink modified 3.7 years ago by Will4.0k • written 3.8 years ago by Eric Normandeau7.1k
4

A series of "code golf" tournaments might be a fun way to seed the proposed "Project Mendel"

ADD REPLYlink written 3.8 years ago by Mitch Skinner580
2

I love the answers to this question! Amazing diversity: we get the perl, python, R scripts then .... whoa ... flex, clojure, erlang, haskell, ocaml, memory mapped C files.

ADD REPLYlink written 3.8 years ago by Istvan Albert ♦♦ 39k
2

This is one of the best posts in the forum! Opened my eyes to the full universe of programming approaches. Can I +1 code golf as well as this post?

ADD REPLYlink written 3.8 years ago by Wjeck430
1

Create a 'flat' fasta for the human genome that for each chromosome contains the entire sequence as a single line. Now run the tools below and see which one can still do it.

ADD REPLYlink written 3.8 years ago by Istvan Albert ♦♦ 39k
1

I would be incredibly impressed to see an answer in BF (http://en.wikipedia.org/wiki/Brainfuck), Var'aq (http://esoteric.voxelperfect.net/wiki/Var%27aq) or Lolcode (http://en.wikipedia.org/wiki/LOLCODE) ... not for usability but just to see a "real-word" application.

ADD REPLYlink written 3.8 years ago by Will4.0k
1

Kind of what I had in mind :) Something like doing 1 golf-code per week and slowly building up a set of fun problems. Maybe there should be and off-biostars group discussing problem propositions and formulation. For example forming a google group... What do you think?

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
22
gravatar for Cassj
3.8 years ago by
Cassj1.1k
London
Cassj1.1k wrote:

Bioconductor?

library(Biostrings)
s <- read.DNAStringSet(filename,"fasta")
sum(width(s)) / length(s)

@Stefano: Yeah, I think the above does just read everything in. Seems to be able to cope with pretty big files, but actually maybe this is better:

library(Biostrings)
s<-fasta.info(filename)
sum(s)/length(s)
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Cassj1.1k
4

Fair enough, although I would argue that's it's the opposite of hacky to use an existing, well supported, open source library. Sure, it's useful to see how to implement solutions to a problem in different languages and in different ways, but it's also fairly handy to know if someone has already done all the work for you somewhere.

ADD REPLYlink written 3.8 years ago by Cassj1.1k
3

Am I wrong or this code would read the whole file in memory? with Huge fasta file it might be problem

ADD REPLYlink written 3.8 years ago by Stefano Berri3.2k
3

The benefit of using high level languages like R is specifically that they have libraries to help make these kind of tasks easy. Why force folks to re-implement a Fasta parser when a robust solution exists? The interesting thing about this post is the wide variety of approaches; you would take that away by putting restrictions on folks creativity.

ADD REPLYlink written 3.8 years ago by Brad Chapman8.1k
2

I'd point out that the original question uses the phrase "reinvent the wheel" and specifies any command-line solution, including emboss. Libraries do not "do everything". You need to know where to find them, how to install them and how to write code around them. These are core bioinformatics skills. Writing code from first principles, on the other hand, is a computer science skill. Personally, I favour quick, elegant and practical solutions to problems over computational theory.

ADD REPLYlink written 3.8 years ago by Neilfws41k
1

There are no rules, but if someone wants to use a library that does everything, then we don't have different approaches or code, we just have libraries and hacks. That's my own perspective.

ADD REPLYlink written 3.8 years ago by Paulo Nuin3.5k
1

Indeed, maybe next time I propose a code golf, I'll ask people to use their favorite language natively, w/o supplementary libraries. Not because it is functionally better, but maybe didactically preferable :) It also shows better how different approaches/paradigms/philosophies of programming give birth to a lot of diversity! Cheers

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
1

@CassJ I personally find nothing wrong with any of the answers posted here. I just have some thoughts that, for further questions, and in the context that these could be used to fuel a maybe-to-be-called Project Mendel, I may be inclined to ask people to use only with native code from their favorite language! Thanks for your answer :)

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
1

One day we won't need to code, we will have only libraries. I dream of that day, when other people can do 100% of the work for me.

ADD REPLYlink written 3.8 years ago by Paulo Nuin3.5k

I always forget about R and bioconductor

ADD REPLYlink written 3.8 years ago by Will4.0k

and littleR could make this command-line-able: http://code.google.com/p/littler/

ADD REPLYlink written 3.8 years ago by Vince360

This could be a winner. Mean = stats, stats = R.

ADD REPLYlink written 3.8 years ago by Neilfws41k

Hmm, I think you're right. It seems to be able to cope with pretty big files though. Maybe this instead...?

library(Biostrings) s<-fasta.info(filename) sum(s)/length(s)

ADD REPLYlink written 3.8 years ago by Cassj1.1k

Not to mention that the "code" itself is just a hack, as it depends on a library. The real code would be to see the actual library function that reads the file and parse the data. I could do this in any language as long as I had a library that would do everything for me.

ADD REPLYlink written 3.8 years ago by Paulo Nuin3.5k

Ah, maybe I missed the point - I thought the OP was just looking for a quick way to solve the problem. Are the rules that you can only use core features of a language? In that case, I guess I'd go with Perl.

ADD REPLYlink written 3.8 years ago by Cassj1.1k

@Eric - Project Mendel is a grand idea :) If it's mainly focusing on bioinformatics algorithms then maybe there's a similar need for a bioinformatics CookBook with JFDI implementations of common tasks in different languages?

ADD REPLYlink written 3.8 years ago by Cassj1.1k

@CassJ This whole discussion about native code vs. use of modules led me to the very same conclusion earlier today. Namely, that 2 different projects were probably needed. I had, not seriously, thought it could be the "Mendel CookBook" :P More seriously, these 2 projects MUST come into existence. They would be a great asset for the developpement of the bioinformatics community among less computer-oriented people among biologists (and maybe as well among less biology-oriented people among informaticians :)

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k

This version is almost twice longer than mine perl version http://biostar.stackexchange.com/questions/1759/code-golf-mean-length-of-fasta-sequences/1805#1805 and is even longer than perl version including invocation. LOL, is it still code golf contest?

ADD REPLYlink written 3.7 years ago by Hynek -Pichi- Vychodil150
12
gravatar for Lars Juhl Jensen
3.8 years ago by
Copenhagen, Denmark
Lars Juhl Jensen8.6k wrote:

How about the following very simple AWK script?

awk '{/>/&&++a||b+=length()}END{print b/a}' file.fa

43 characters only (not counting file name). AWK is not known for its speed, though.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Lars Juhl Jensen8.6k

Indeed, this is the slowest I tried so far! 18secs for the uniprot_sprot.fasta file. Funny, when I asked the question, I thought this approach, being 'closer to linux', would be the fastest. We learn everyday :)

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
11
gravatar for Will
3.8 years ago by
Will4.0k
Will4.0k wrote:

Since it uses itertools and never loads anything into memory this should scale well for even giant fasta files.

from itertools import groupby, imap

tot = 0
num = 0
with open(filename) as handle:
    for header, group in groupby(handle, lambda x:x.startswith('>')):
        if not header:
            num += 1
            tot += sum(imap(lambda x: len(x.strip()), group))
result = float(tot)/num
ADD COMMENTlink modified 3.7 years ago by Eric Normandeau7.1k • written 3.8 years ago by Will4.0k

2.55 seconds for 400,000 sequences. Not bad! I think there is a mistake with the 'imap' function. It should be 'map' :) (I change it)

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k

actually its supposed to be imap but also need another import form itertools ... the nice thing about imap is that it won't load the whole iterator into memory at once, it will process the iterator as requested. This is REALLY useful when dealing with fasta files of contigs from many genomes.

ADD REPLYlink written 3.8 years ago by Will4.0k

I also corrected that the result should be a float so I put 'float(tot)' instead of 'tot' in the result.

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k

one of these days I'll remember that Python doesn't auto-convert on division ... that gotcha has bitten me in the ass many times before, thank goodness for unit-testing.

ADD REPLYlink written 3.8 years ago by Will4.0k
11
gravatar for Neilfws
3.8 years ago by
Neilfws41k
Sydney, Australia
Neilfws41k wrote:

And just for completeness, a version using the BioRuby library:

#!/usr/bin/ruby
require "rubygems"
require "bio"
seqLen = []

Bio::FlatFile.open(ARGV[0]) do |ff|
  ff.each do |seq|
    seqLen << seq.length
  end
end

puts (seqLen.inject(0) { |sum, element| sum + element }).to_f / seqLen.size
ADD COMMENTlink written 3.8 years ago by Neilfws41k

you can get rid of the (0) after inject, I think.

ADD REPLYlink written 3.8 years ago by Science_Robot970

You could. It just sets an initial value for sum.

ADD REPLYlink written 3.8 years ago by Neilfws41k

I voted +1 (regretfully) for your anwser Neil ;-)

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

Very gracious, I may return the favour :-)

ADD REPLYlink written 3.8 years ago by Neilfws41k
10
gravatar for Stefano Berri
3.8 years ago by
Stefano Berri3.2k
Cambridge
Stefano Berri3.2k wrote:

This seem to do it. straight from the command line. Obviously, being perl, it is compact but a bit obfuscate (also in the reasoning ;))

perl -e 'while(<>){if (/^>/){$c++}else{$a+=length($_)-1}} print $a/$c' myfile.fa

It basically counts all nucleotides (in $a) and the number of sequences (in $c). At the end divides the number of nucleotides by the number of sequences, getting the average length.

$a+=(length($_)-1)

-1 because there is the newline to take into account. use -2 if you use a MS window file. Or chomp.

Does not require lot of memory, and it only need to read the file once.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Stefano Berri3.2k
2

perl -le 'while(<>){/^>/?$c++:{$a+=length}} print $a/$c' myfile.fa

ADD REPLYlink written 3.8 years ago by Cassj1.1k
1

perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa

ADD REPLYlink written 3.8 years ago by The Original Gtk170

I see it uses a similar logic as the one posted by Will. Didn't see it, but he posted first

ADD REPLYlink written 3.8 years ago by Stefano Berri3.2k

I actually pilfered the idea from a blog post: http://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/ ... I use it for all of my production code since it deals well with giant (or functionally infinite) fasta files.

ADD REPLYlink written 3.8 years ago by Will4.0k

Same thing but a few chars shorter...

perl -le 'while(<>){/^>/?$c++:{$a+=length}} print $a/$c' myfile.fa

ADD REPLYlink written 3.8 years ago by Cassj1.1k

and I thought perl code properly indented was tough to read ... that might just as well be in Wingdings to me ;)

ADD REPLYlink written 3.8 years ago by Will4.0k

[?]perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa[?]

I personally like the implicit while loop and use of and END block in command line versions such as this.

ADD REPLYlink written 3.8 years ago by The Original Gtk170

perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa

I personally like the implicit while loop and use of and END block in command line hacks such as this.

ADD REPLYlink written 3.8 years ago by The Original Gtk170

I personally like the use of the -n option plus an END block for this kind of command line version.

ADD REPLYlink written 3.8 years ago by The Original Gtk170
10
gravatar for Yang
3.8 years ago by
Yang190
Yang190 wrote:

The perl script might be not so short but quite simple:

my($n, $length) = (0, 0);
while(<>) {
       chomp;
       if(/^>/){
           $n++;
       }else {
           $length += length $_;
       }
   }
  print $length/$n;
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Yang190
1

I LIKE to see perl written on many lines like that :) My eyes hurt less!

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
1

In fact, code like this one bring me closer to the day where I will want to learn Perl! Thanks!

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
10
gravatar for Mitch Skinner
3.8 years ago by
Mitch Skinner580
Berkeley, CA
Mitch Skinner580 wrote:

Haskell Version

I'm a relative Haskell newbie, and I'm sure this could be written more compactly or more elegantly. It's fast, though; even faster than the FLEX version, in my hands. I did see some timing strangeness that I haven't figured out, though--I thought the flex version ran faster the first time I tried it, but now it's consistently slower when I try to run it. Also, I seem to get a different mean value with the perl version than with the others I've tried. So please do reproduce.

EDIT: As Pierre Lindenbaum points out, I forgot the "-f" flag to flex; the times below are updated with that flag.

[?]


Some comparisons, on my machine:

[?]

Thanks for the nice clean problem to learn with!

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Mitch Skinner580
1

Interesting. I had a look at haskell before. By the way, I compiled my flex version with "-f": the lexer is larger but faster.

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

B.interact takes stdin and reads it into a string. And it does that lazily, so this code doesn't ever have the entire file in memory. B.lines separates that string into a list of lines (eliminating newlines); foldl' fastaCount (0, 0) folds the "fastaCount" function over the list of lines, accumulating a count of sequences and a sum of their lengths. showMean takes the resulting count and length and returns a string with the mean, and B.pack does a bit of type conversion (turns the regular string from "show" into the bytestring that B.interact expects)

ADD REPLYlink written 3.8 years ago by Mitch Skinner580

Ah, the "-f" is what I forgot. With -f, the flex version takes 0.986 seconds real time on my machine (about 0.24 seconds slower than the haskell version)

ADD REPLYlink written 3.8 years ago by Mitch Skinner580

import Bio.Sequence

main = do ls <- map seqlength fmap readFasta "input.fasta" print (sum ss div length ss)

ADD REPLYlink written 3.5 years ago by Ketil3.3k

You could also use the Haskell bioinformatics library, which provides Fasta parsing and, among other things, a seqlength function.

ADD REPLYlink written 3.5 years ago by Ketil3.3k
10
gravatar for Keith James
3.8 years ago by
Keith James5.4k
UK
Keith James5.4k wrote:

Common Lisp version using cl-genomic (shameless self-promotion)

(asdf:load-system :cl-genomic)
(in-package :bio-sequence-user)

(time (with-seq-input (seqi "uniprot_sprot.fasta" :fasta :alphabet :aa
                      :parser (make-instance 'virtual-sequence-parser))
  (loop
     for seq = (next seqi)
     while seq
     count seq into n
     sum (length-of seq) into m
     finally (return (/ m n)))))

Evaluation took:
  9.301 seconds of real time
  9.181603 seconds of total run time (8.967636 user, 0.213967 system)
  [ Run times consist of 0.791 seconds GC time, and 8.391 seconds non-GC time. ]
  98.72% CPU
  22,267,952,541 processor cycles
  2,128,118,720 bytes consed

60943088/172805

Reading the other comments, use of libraries seems controversial, so in plain Common Lisp

(defun mean-seq-len (file)
  (declare (optimize (speed 3) (safety 0)))
  (with-open-file (stream file :element-type 'base-char
                          :external-format :ascii)
    (flet ((headerp (line)
             (declare (type string line))
             (find #\> line))
           (line ()
             (read-line stream nil nil)))
      (do ((count 0)
           (total 0)
           (length 0)
           (line (line) (line)))
          ((null line) (/ (+ total length) count))
        (declare (type fixnum count total length))
        (cond ((headerp line)
               (incf count)
               (incf total length)
               (setf length 0))
              (t
               (incf length (length line))))))))

(time (mean-seq-len "uniprot_sprot.fasta"))

Evaluation took:
  3.979 seconds of real time
  3.959398 seconds of total run time (3.803422 user, 0.155976 system)
  [ Run times consist of 0.237 seconds GC time, and 3.723 seconds non-GC time. ]
  99.50% CPU
  9,525,101,733 processor cycles
  1,160,047,616 bytes consed

60943088/172805
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Keith James5.4k
1

Thanks, I was only dreaming of seeing a Common Lisp answer to this question :)

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
9
gravatar for Hynek -Pichi- Vychodil
3.8 years ago by
Brno, Czech Republic
Hynek -Pichi- Vychodil150 wrote:

34 chars, 45 including invocation.

$ time perl -nle'/^>/?$c++:($b+=length)}{print$b/$c' uniprot_sprot.fasta
352.669702844246

real    0m2.169s
user    0m2.048s
sys     0m0.080s
ADD COMMENTlink written 3.8 years ago by Hynek -Pichi- Vychodil150
8
gravatar for Simon Cockell
3.8 years ago by
Simon Cockell6.6k
Newcastle
Simon Cockell6.6k wrote:

Challenge accepted!

5 lines of Python (only 3 of actual work) that print the mean, given the fasta file as a command line argument. It is a bit obfuscated, but I went for brevity:

import sys
from Bio import SeqIO
handle = open(sys.argv[1], 'rU')
lengths = map(lambda seq: len(seq.seq), SeqIO.parse(handle, 'fasta'))
print sum(lengths)/float(len(lengths))

Disclaimer: I have no idea how this will scale. Probably not great.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Simon Cockell6.6k
1

Simon, you can replace the reduce(lambda x,y: x+y, lengths) bit with just sum(lengths) to make the last line more straightforward.

ADD REPLYlink written 3.8 years ago by Brad Chapman8.1k

you could save a line by putting the open() in the SeqIO.parse() call, but my soul rebelled at that point.

ADD REPLYlink written 3.8 years ago by Simon Cockell6.6k

Thanks :) I understand. I tend to write longer code these days in order to be clearer. After all, I'm going to have to re-read this code sometime!

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k

It took 6.6 seconds for 400,000 sequences. Not uber fast, but usable :)

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k

Brad, of course, I always forget about sum()... Changed for the increased readability.

ADD REPLYlink written 3.8 years ago by Simon Cockell6.6k

If it depends on an external library, I can do in one line.

ADD REPLYlink written 3.8 years ago by Paulo Nuin3.5k

Since the question was about not re-inventing the wheel, I decided not to do it with the fasta parsing. But since that's where the biggest performance hit is, I guess I could have done better implementing it myself.

ADD REPLYlink written 3.8 years ago by Simon Cockell6.6k
8
gravatar for Neilfws
3.8 years ago by
Neilfws41k
Sydney, Australia
Neilfws41k wrote:

Here is another R example, this time using the excellent seqinR package:

library(seqinr)
fasta <- read.fasta("myfile.fa")
mean(sapply(fasta, length))

And here's a way to run it from the command line - save this code as meanLen.R:

library(seqinr)
fasta <- read.fasta(commandArgs(trailingOnly = T))
mean(sapply(fasta, length))

And run it using:

Rscript meanLen.R path/to/fasta/file
ADD COMMENTlink written 3.8 years ago by Neilfws41k
2

R is open source. Feel free to examine the library code yourself :-)

ADD REPLYlink written 3.8 years ago by Neilfws41k
1

Another R hack, show me the function in the library and I will upvote it.

ADD REPLYlink written 3.8 years ago by Paulo Nuin3.5k
8
gravatar for Lars Juhl Jensen
3.8 years ago by
Copenhagen, Denmark
Lars Juhl Jensen8.6k wrote:

Having made the shortest and slowest program, allow me to also present a longer but much faster solution in C that makes use of UNIX memory mapping to read the input file:

#include "sys/types.h"
#include "sys/stat.h"
#include "sys/mman.h"
#include "fcntl.h"
#include "stdio.h"
#include "unistd.h"

void main(int ARGC, char *ARGV[]) {

  int fd;
  struct stat fs;
  char *p1, *p2, *p3;
  unsigned long c1, c2;

  // Map file to memory.
  fd = open(ARGV[1], O_RDWR);
  fstat(fd, &fs);
  p1 = mmap(NULL, fs.st_size, PROT_READ, MAP_SHARED, fd, 0);
  p2 = p1;
  p3 = p1+fs.st_size;

  // Do the actual counting.
  c1 = 0;
  c2 = 0;
  while (p2 < p3 && *p2 != '>') ++p2;
  while (*p2 == '>') {
    ++c1;
    while (p2 < p3 && *p2 != '\n') ++p2;
    while (p2 < p3 && *p2 != '>') {
      if (*p2 >= 'A' && *p2 <= 'Z') ++c2;
      ++p2;
    }
  }

  // Print result.
  printf("File contains %d sequences with average length %.0f\n", c1, (float)c2/c1);

  // Unmap and close file.
  munmap(p1, fs.st_size);
  close(fd);

}

On my server it takes only 0.43s wall time (0.37s CPU time) to process uniprot_sprot.fasta

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Lars Juhl Jensen8.6k

maybe you can put the code to compile this program?

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k

gcc -O3 -o avglength avglength.c

ADD REPLYlink written 3.8 years ago by Lars Juhl Jensen8.6k

It's C, but it's not ANSI :-)

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

please enlighten me ... which part of it is not ANSI? (ok, appart from the // that should be / / if I remember right.)

ADD REPLYlink written 3.8 years ago by Lars Juhl Jensen8.6k

please enlighten me - which part is not ANSI-C? (ok, I know that the // comments should be / /)

ADD REPLYlink written 3.8 years ago by Lars Juhl Jensen8.6k

"unistd.h" , open() , mmap(), "sys/*.h", "fcntl.h" are not defined in the ANSI-C spec.

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

but I'm just being picky :-)

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k
7
gravatar for Pierre Lindenbaum
3.8 years ago by
France
Pierre Lindenbaum58k wrote:

something like a copy+paste from a previous question: http://biostar.stackexchange.com/questions/1718 . It won't be the shortest code, but it may be the fastest.

The following code is a FLEX lexer, with the following rules

  • increment the number of sequences for each sequence line ( /^>.*n/ )
  • increment the total size for each symbol of the sequence
  • ignore the spaces
  • raise an error for the other cases:

    %{
    #include <stdio.h>
    #include <stdlib.h>
    static long sequences=0;
    static long size=0L;
    %}
    %option noyywrap
    
    %%
    ^>.*\n   {
         sequences++;
         }
    
    ^[A-Za-z]+  {size+=yyleng;}
    
    [ \t\n] ;
    
    .   {
        fprintf(stderr,"Illegal character  \"%s\".", yytext);
        exit(EXIT_FAILURE);
        }
    
    %%
    
    int main(int argc,char** argv)
        {
        yylex();
        if(sequences==0) return EXIT_FAILURE;
        printf("%f\n",(double)size/(double)sequences);
        return EXIT_SUCCESS;
        }
    

Compilation:

flex -f golf.l
gcc -O3 -Wall lex.yy.c

Test:

wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
gunzip uniprot_sprot.fasta.gz ##232Mo

time ./a.out < uniprot_sprot.fasta

352.669703

real    0m0.692s
user    0m0.572s
sys 0m0.120s
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Pierre Lindenbaum58k
1

definately faster:

time python mean_length.py uniprot_sprot.fasta

352.669702844

11.04s user 0.20s system 100% cpu 11.233 total

ADD REPLYlink written 3.8 years ago by Simon Cockell6.6k

Thanks for testing Simon

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k
7
gravatar for Science_Robot
3.8 years ago by
Science_Robot970
Gainesville, FL
Science_Robot970 wrote:

In Ruby!

a,b=0,[];File.new(ARGV[1]).each do |i|;if i[0]!=62;a+=i.length-1;else;b.push a;a=0;end;end;puts b.inject{|s,v|s+=v}/b.length.to_f

130 Chars, works w/ wrapped FASTA files.

Uncondensed:

a, b = 0, []
File.new(ARGV[0]).each do |i|
  if i[0]!=62
    a += i.length-1
  else
    b.push a
    a=0
  end
end;
puts b.inject{ |s, v| s+=v }/b.length.to_f

Just now learning Ruby so help me out.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Science_Robot970

You're definitely not late, as the correct answer will be decided in 3 days and all solutions are welcomed! :) The assumption that the fasta files are not wrapped is probably not met most of the time, however. Cheers!

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
7
gravatar for Brad Chapman
3.8 years ago by
Brad Chapman8.1k
Boston, MA
Brad Chapman8.1k wrote:

A Clojure version using BioJava to handle the Fasta parsing:

(import 'org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.java.io])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn lazy-avg [coll]
  "Collect the count and number of values in a collection in one pass.

  1 define the function that increments the sum and counts.
  2 reduce the collection, updating the sum and count, and assign to variables
  3 divide the sum by count, keeping it safe in case of 0 division
  "
  (let [f (fn [[s c] val] [(+ s val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt))))

(when *command-line-args*
  (println
    (lazy-avg (apply fasta-to-lengths *command-line-args*))))

and a more specific implementation without using external libraries for FASTA:

(use '[clojure.java.io])
(use '[clojure.contrib.str-utils2 :only (join)])

(defn fasta-lengths [in-file]
  "Generate collection of FASTA record lengths, splitting at '>' delimiters"
  (->> (line-seq (reader in-file))
    (partition-by #(.startsWith ^String % ">"))
    (filter #(not (.startsWith ^String (first %) ">")))
    (map #(join "" %))
    (map #(.length ^String %))))

(defn lazy-avg [coll]
  "Collect the count and number of values in a collection in one pass.

  1 define the function that increments the sum and counts.
  2 reduce the collection, updating the sum and count, and assign to variables
  3 divide the sum by count, keeping it safe in case of 0 division
  "
  (let [f (fn [[s c] val] [(+ s val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt))))

(when *command-line-args*
  (println
    (lazy-avg (fasta-lengths (first *command-line-args*)))))

I've been iterating over a couple versions of this to improve performance. This thread on StackOverflow has more details for those interested.

% time cljr run fasta_counter.clj uniprot_sprot.fasta PROTEIN 
60943088/172805
11.84s
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Brad Chapman8.1k
7
gravatar for Tom Walsh
3.8 years ago by
Tom Walsh500
Tom Walsh500 wrote:

OCaml:

[?]

To give you some idea of the performance, the natively compiled version takes about 1.4s (real time) on my laptop vs 0.6 s for the C version posted by Lars Juhl Jensen.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Tom Walsh500
6
gravatar for Fred Fleche
3.8 years ago by
Fred Fleche3.7k
Paris, France
Fred Fleche3.7k wrote:

[?]

$totalLength   = 0;
$nbSequences   = 0;
$averageLength = 0;

$fastaFile = fopen("./database/uniprot_sprot.fasta", "r");
if ($fastaFile) {
    while (!feof($fastaFile)) {
        $line = fgets($fastaFile, 4096); 
        if(preg_match('/^>/', $line)) {
          $nbSequences++;
        }
        else {
          $totalLength += strlen(trim($line));
        }

    }
    fclose($fastaFile);
}

$averageLength = ($totalLength / $nbSequences);

print "Total sequences : " . $nbSequences   . "\n";
print "Total length    : " . $totalLength   . "\n";
print "Average length  : " . $averageLength . "\n";

Output :

[?]

[?]

[?]

ADD COMMENTlink written 3.8 years ago by Fred Fleche3.7k
6
gravatar for Science_Robot
3.8 years ago by
Science_Robot970
Gainesville, FL
Science_Robot970 wrote:

Can I post more than one?

grep/awk:

grep '^[GATC].*' f | awk '{sum+=length($0)}END{print sum/NR}'

where f is the filename ;)

62 chars, btw.

ADD COMMENTlink written 3.8 years ago by Science_Robot970

dangit... this too only works for unwrapped fasta files. shoot me!

ADD REPLYlink written 3.8 years ago by Science_Robot970
6
gravatar for Hynek -Pichi- Vychodil
3.8 years ago by
Brno, Czech Republic
Hynek -Pichi- Vychodil150 wrote:

Erlang special golfing 213 chars version:

-module(g).
-export([s/0]).
s()->open_port({fd,0,1},[in,binary,{line,256}]),r(0,0),halt().
r(C,L)->receive{_,{_,{_,<<$>:8,_/binary>>}}}->r(C+1,L);{_,{_,{_,Line}}}->r(C,L+size(Line));_->io:format("~p~n",[L/C])end.

Readable but reliable version:

-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<$>:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C]);
    X -> io:format("Unexpected: ~p~n",[X]),exit(bad_data)
  end.

Compile:

$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl

Invocation:

$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Hynek -Pichi- Vychodil150
5
gravatar for Pierre Lindenbaum
3.8 years ago by
France
Pierre Lindenbaum58k wrote:

Another answer. Last time I used FLEX, I now use the GNU BISON parser generator. Here I've created a simple grammar to describe a Fasta File.Of course this solution is slower than Flex :-)

%{

include <stdio.h>

include <ctype.h>

static long sequences=0L; static long total=0L; %} %error-verbose %union { int count; }

%token LT %token<count> SYMBOL %token CRLF %token OTHER

%start file

%{ void yyerror(const char* message) { fprintf(stderr,"Error %s\n",message); }

int yylex() { int c=fgetc(stdin); if(c==-1) return EOF; if(c=='>') return LT; if(c=='\n') return CRLF; if(isalpha(c)) return SYMBOL; return OTHER; } %} %%

file: fasta | file fasta; fasta: header body { ++sequences; }; header: LT noncrlfs crlfs; body: line | body line; line: symbols crlfs; symbols: symbol {++total;}| symbols symbol {++total;}; symbol: SYMBOL; crlfs: crlf | crlfs crlf; crlf: CRLF; noncrlfs:noncrlf | noncrlfs noncrlf; noncrlf: SYMBOL| OTHER | LT; %%

int main(int argc,char *argv) { yyparse(); if(sequences>0L) fprintf(stdout,"%f\n",(total/(1.0sequences))); return 0; }

Compilation:

bison golf.y gcc -Wall -O3 golf.tab.c

Test:

time ./a.out < uniprot_sprot.fasta 352.669703

real0m9.723s user0m9.633s sys0m0.080s

ADD COMMENTlink written 3.8 years ago by Pierre Lindenbaum58k
5
gravatar for David W
3.8 years ago by
David W3.0k
New Zealand
David W3.0k wrote:

And now, with awk:

$awk '$1!~ /^>/ {total += length; count++} END {print total/count}' uniprot_sprot.fasta

Still not smoking fast but down to one line (for certain values of "one" and "line")

EDIT

And,as Pierre points out below, I got this wrong again, as it doesn't account for wrapped lines. This one does work:

$ awk '{if (/^>/) {record ++} else {len += length}} END {print len/record}' uniprot_sprot.fasta

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by David W3.0k
4

It returns the mean length of the fasta lines. We need the mean length of the whole fasta sequences.

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

right, should have worked that out really (i'd tested it on an unwrapped fasta file)

ADD REPLYlink written 3.8 years ago by David W3.0k
5
gravatar for Pierre Lindenbaum
3.8 years ago by
France
Pierre Lindenbaum58k wrote:

Using MYSQL, yes we can:

create temporary table T1(seq varchar(255));
LOAD data local infile "uniprot_sprot.fasta" INTO TABLE T1 (seq);
select @SEQ:=count(*) from T1 where left(seq,1)=">";
select @TOTAL:=sum(length(trim(seq))) from T1 where left(seq,1)!=">";
select @TOTAL/@SEQ;

Result:

me@linux-zfgk:~> time mysql -u root -D test < golf.sql
@SEQ:=count(*)
518415
@TOTAL:=sum(length(trim(seq)))
182829264
@TOTAL/@SEQ
352.669702844000000000000000000000

real    0m44.224s
user    0m0.008s
sys     0m0.008s
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Pierre Lindenbaum58k
4
gravatar for David W
3.8 years ago by
David W3.0k
New Zealand
David W3.0k wrote:

Here's another python one, it's really the same as Simon's but using list expressions instead of lamdas (I find them more readable, you might not)

import sys
from Bio import SeqIO
lengths = [len(seq) for seq in SeqIO.parse(sys.argv[1], "fasta")]
print sum(lengths)/float(len(lengths))

For SeqIO to accept filenames as strings (instead of handles) you need biopython 1.54.

And I'm sure this will be slower than any of the other options already presented ;)

EDIT:

As Eric points out below this pumps all the lengths into one big list, which isn't exactly memory extensive ;)

I can hang on to my completely irrational fear of lamda/map and itertools with a generator expression:

lengths = (len(seq) for seq in SeqIO.parse(sys.argv[1], "fasta"))
total= 0
for index, length in enumerate(lengths):
    total += length
print total/float(index + 1)

Still pretty slow, about a minute on computer, so probably not a goer for the original challenge!

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by David W3.0k

@david. Slow, I think, is not the problem here. For very large fasta files, your code is going to create a list of many thousand hundreds (or millions) of integers, which is going to be memory intensive. The best thing to do I guess would be to test it with the data that Pierre used for his test, the uniprot_sprot.fasta file :) Cheers

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k

Oh, yeah, Read the specficication ...

Just added a generator expression version, interestingly enough the uniprot_sprot file doesn't make too much of a footprint with ~500 000 integers. But both methods are really too slow for what you want to do ;(

ADD REPLYlink written 3.8 years ago by David W3.0k

I guess I should have tested the memory footprint before putting the comment in the first place. A list of 50,000,000 integers will take about 1 Gb or so (just saturated the memory of my old faithful laptop trying).

ADD REPLYlink written 3.8 years ago by Eric Normandeau7.1k
4
gravatar for Pierre Lindenbaum
3.8 years ago by
France
Pierre Lindenbaum58k wrote:

A third variation, this time using Javacc (the Java Compiler Compiler):

    options { STATIC=false;}

    PARSER_BEGIN(Golf)
    public class Golf
        {
        private int sequences=0;
        private int total=0;
        public static void main(String args[]) throws Exception
            {
            new Golf(new java.io.BufferedInputStreamSystem.in)).input();
            }
        }
    PARSER_END(Golf)

    TOKEN: /*RESERVED TOKENS FOR UQL */
    {
          <GT: "="">">
       |  <CRLF: ("\n")+="">
       |  <SYMBOLS: (["A"-"Z",="" "a"-"z"])+="">
       |  <OTHER: (~["A"-"Z",="" "a"-"z","\n","="">"]) >
    }

    void input():{} { (fasta() )+ {System.out.println(total/(double)sequences);} }
    void fasta():{} { header() (line())*}
    void header():{} {  <GT> (<SYMBOLS>|<OTHER>|<GT>)* <CRLF> {sequences++;}}
    void line():{Token t;} {  t=<SYMBOLS> <CRLF> { total+=t.image.length(); }}

Compile

            javacc Golf.javacc
            javac Golf.java

Test

            time java Golf < uniprot_sprot.fasta 
            352.6697028442464

            real    0m11.014s
            user    0m11.529s
            sys 0m0.180s
ADD COMMENTlink written 3.8 years ago by Pierre Lindenbaum58k
4
gravatar for Tom Walsh
3.8 years ago by
Tom Walsh500
Tom Walsh500 wrote:

Perl 6. Painfully slow, so I'm probably doing something dumb that isn't obvious to me.

[?]

ADD COMMENTlink written 3.8 years ago by Tom Walsh500

one suggestion: in the case of multi-line FASTA files, the more common scenario is that any given line will NOT be a header line. thus, one speedup would be to test for "ne '>'" first in the while loop. not sure what speeds you're seeing but you could also try to "slurp" the file into memory first.

ADD REPLYlink written 3.8 years ago by Aaronquinlan7.3k

OK I'll give it a go. It's still much slower than Perl5 doing the same thing, i.e. reading line by line.

ADD REPLYlink written 3.8 years ago by Tom Walsh500

Maybe a grammar would be the best option? It is Perl 6...

ADD REPLYlink written 2.2 years ago by Chris Fields460

I might give a grammar a go but I don't expect it'll help much. Even basic I/O - reading a file line by line = is still an order of magnitude slower than in Perl 5.

ADD REPLYlink written 2.2 years ago by Tom Walsh500
3
gravatar for Pierre Lindenbaum
3.8 years ago by
France
Pierre Lindenbaum58k wrote:

another [?]fastest[?] solution in 'C', with a lot of micro-optimization

#include <stdio.h>
#define BUFFER_SIZE 10000000
int main(int argc,char** argv)
    {
    size_t i;
    char buffer[BUFFER_SIZE];
    long total=0L;
    long sequences=0L;
    size_t nRead=0;
    size_t inseq=1;
    while((nRead=fread(buffer,sizeof(char),BUFFER_SIZE,stdin))>0)
        {
        for(i=0;i<nRead;++i)
            {
            switch(buffer[i])
                {
                case '>': if(inseq==1) ++sequences; inseq=0;break;
                case '\n': inseq=1;break;
                case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
                case 'G': case 'H': case 'I': case 'J': case 'K': case 'L':
                case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R':
                case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
                case 'Y': case 'Z':
                case 'a': case 'b': case 'c': case 'd': case 'e': case 'f':
                case 'g': case 'h': case 'i': case 'j': case 'k': case 'l':
                case 'm': case 'n': case 'o': case 'p': case 'q': case 'r':
                case 's': case 't': case 'u': case 'v': case 'w': case 'x':
                case 'y': case 'z': if(inseq==1) ++total;break;
                default: break;
                }
            }
        }
    printf("%f\n",(total/(1.0*sequences)));
    return 0;
    }

Compilation:

gcc -O3 -Wall golf.c

Result:

time ./a.out < uniprot_sprot.fasta 
352.669703

real    0m0.952s
user    0m0.808s
sys 0m0.140s
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Pierre Lindenbaum58k
1

Pierre, not to blow my own horn, but on my machine your C implementation is more than 2x slower than the one I posted ;)

ADD REPLYlink written 3.8 years ago by Lars Juhl Jensen8.6k
1

but my solution is ANSI :-))

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

oh nooooooo :-)

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

Touché! I guess the price of standard compliance is 2-3x loss in speed? Sounds about right to me ;-)

ADD REPLYlink written 3.8 years ago by Lars Juhl Jensen8.6k

Touché! I guess the price to pay for standards compliance is 2-3x loss in execution speed? Sounds about right to me ;-)

ADD REPLYlink written 3.8 years ago by Lars Juhl Jensen8.6k

ab-so-lu-te-ly ! :-)

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 619 users visited in the last hour