Traffic: 514 ip/hr
Question: Code golf: mean length of fasta sequences

Hi,

I was about to re-invent the wheel, again, when I thought that there are probably many among you who had already solved this problem.

I have a few huge fasta files (half a million sequences) and I would like to know the average length of the sequences for each of these files, one at a time.

So, let's call this a code golf. The goal is to make a very short code to accomplish the following:

Given a fasta file, return the average length of the sequences.

The correct answer will go to the answer with the most votes on friday around 16h in Quebec (Eastern Time).

You can use anything that can be run on a linux terminal (your favorite language, emboss, awk...), diversity will be appreciated :)

Cheers

CONCLUSION:

And the correct answer goes to the most voted question, as promised :) Thank you all for your great participation! I think I am not the only one to have appreciated the great diversity in the answers, as well as very interesting discussions and ideas.

EDIT:

Although the question is now closed, do not hesitate to vote for all the answers you found interesting, especially those at the bottom! They are not necessarily less interesting. They often have only arrived later :)

Thanks again!

Create a 'flat' fasta for the human genome that for each chromosome contains the entire sequence as a single line. Now run the tools below and see which one can still do it.

I would be incredibly impressed to see an answer in BF (http://en.wikipedia.org/wiki/Brainfuck), Var'aq (http://esoteric.voxelperfect.net/wiki/Var%27aq) or Lolcode (http://en.wikipedia.org/wiki/LOLCODE) ... not for usability but just to see a "real-word" application.

A series of "code golf" tournaments might be a fun way to seed the proposed "Project Mendel"

Kind of what I had in mind :) Something like doing 1 golf-code per week and slowly building up a set of fun problems. Maybe there should be and off-biostars group discussing problem propositions and formulation. For example forming a google group... What do you think?

I love the answers to this question! Amazing diversity: we get the perl, python, R scripts then .... whoa ... flex, clojure, erlang, haskell, ocaml, memory mapped C files.

This is one of the best posts in the forum! Opened my eyes to the full universe of programming approaches. Can I +1 code golf as well as this post?

Bioconductor?

``````library(Biostrings)
sum(width(s)) / length(s)
``````

@Stefano: Yeah, I think the above does just read everything in. Seems to be able to cope with pretty big files, but actually maybe this is better:

``````library(Biostrings)
s<-fasta.info(filename)
sum(s)/length(s)
``````

I always forget about R and bioconductor

and littleR could make this command-line-able: http://code.google.com/p/littler/

This could be a winner. Mean = stats, stats = R.

Am I wrong or this code would read the whole file in memory? with Huge fasta file it might be problem

Hmm, I think you're right. It seems to be able to cope with pretty big files though. Maybe this instead...?

library(Biostrings) s<-fasta.info(filename) sum(s)/length(s)

Not to mention that the "code" itself is just a hack, as it depends on a library. The real code would be to see the actual library function that reads the file and parse the data. I could do this in any language as long as I had a library that would do everything for me.

Ah, maybe I missed the point - I thought the OP was just looking for a quick way to solve the problem. Are the rules that you can only use core features of a language? In that case, I guess I'd go with Perl.

There are no rules, but if someone wants to use a library that does everything, then we don't have different approaches or code, we just have libraries and hacks. That's my own perspective.

Indeed, maybe next time I propose a code golf, I'll ask people to use their favorite language natively, w/o supplementary libraries. Not because it is functionally better, but maybe didactically preferable :) It also shows better how different approaches/paradigms/philosophies of programming give birth to a lot of diversity! Cheers

Fair enough, although I would argue that's it's the opposite of hacky to use an existing, well supported, open source library. Sure, it's useful to see how to implement solutions to a problem in different languages and in different ways, but it's also fairly handy to know if someone has already done all the work for you somewhere.

@CassJ I personally find nothing wrong with any of the answers posted here. I just have some thoughts that, for further questions, and in the context that these could be used to fuel a maybe-to-be-called Project Mendel, I may be inclined to ask people to use only with native code from their favorite language! Thanks for your answer :)

The benefit of using high level languages like R is specifically that they have libraries to help make these kind of tasks easy. Why force folks to re-implement a Fasta parser when a robust solution exists? The interesting thing about this post is the wide variety of approaches; you would take that away by putting restrictions on folks creativity.

I'd point out that the original question uses the phrase "reinvent the wheel" and specifies any command-line solution, including emboss. Libraries do not "do everything". You need to know where to find them, how to install them and how to write code around them. These are core bioinformatics skills. Writing code from first principles, on the other hand, is a computer science skill. Personally, I favour quick, elegant and practical solutions to problems over computational theory.

@Eric - Project Mendel is a grand idea :) If it's mainly focusing on bioinformatics algorithms then maybe there's a similar need for a bioinformatics CookBook with JFDI implementations of common tasks in different languages?

@CassJ This whole discussion about native code vs. use of modules led me to the very same conclusion earlier today. Namely, that 2 different projects were probably needed. I had, not seriously, thought it could be the "Mendel CookBook" :P More seriously, these 2 projects MUST come into existence. They would be a great asset for the developpement of the bioinformatics community among less computer-oriented people among biologists (and maybe as well among less biology-oriented people among informaticians :)

One day we won't need to code, we will have only libraries. I dream of that day, when other people can do 100% of the work for me.

This version is almost twice longer than mine perl version http://biostar.stackexchange.com/questions/1759/code-golf-mean-length-of-fasta-sequences/1805#1805 and is even longer than perl version including invocation. LOL, is it still code golf contest?

How about the following very simple AWK script?

``````awk '{/>/&&++a||b+=length()}END{print b/a}' file.fa
``````

43 characters only (not counting file name). AWK is not known for its speed, though.

Indeed, this is the slowest I tried so far! 18secs for the uniprot_sprot.fasta file. Funny, when I asked the question, I thought this approach, being 'closer to linux', would be the fastest. We learn everyday :)

Since it uses itertools and never loads anything into memory this should scale well for even giant fasta files.

``````from itertools import groupby, imap

tot = 0
num = 0
with open(filename) as handle:
for header, group in groupby(handle, lambda x:x.startswith('>')):
num += 1
tot += sum(imap(lambda x: len(x.strip()), group))
result = float(tot)/num
``````

2.55 seconds for 400,000 sequences. Not bad! I think there is a mistake with the 'imap' function. It should be 'map' :) (I change it)

actually its supposed to be `imap` but also need another `import` form `itertools` ... the nice thing about imap is that it won't load the whole iterator into memory at once, it will process the iterator as requested. This is REALLY useful when dealing with fasta files of contigs from many genomes.

I also corrected that the result should be a float so I put 'float(tot)' instead of 'tot' in the result.

one of these days I'll remember that Python doesn't auto-convert on division ... that gotcha has bitten me in the ass many times before, thank goodness for unit-testing.

And just for completeness, a version using the BioRuby library:

``````#!/usr/bin/ruby
require "rubygems"
require "bio"
seqLen = []

Bio::FlatFile.open(ARGV[0]) do |ff|
ff.each do |seq|
seqLen << seq.length
end
end

puts (seqLen.inject(0) { |sum, element| sum + element }).to_f / seqLen.size
``````

you can get rid of the (0) after inject, I think.

You could. It just sets an initial value for sum.

I voted +1 (regretfully) for your anwser Neil ;-)

Very gracious, I may return the favour :-)

This seem to do it. straight from the command line. Obviously, being perl, it is compact but a bit obfuscate (also in the reasoning ;))

`perl -e 'while(<>){if (/^>/){\$c++}else{\$a+=length(\$_)-1}} print \$a/\$c' myfile.fa`

It basically counts all nucleotides (in \$a) and the number of sequences (in \$c). At the end divides the number of nucleotides by the number of sequences, getting the average length.

`\$a+=(length(\$_)-1)`

-1 because there is the newline to take into account. use -2 if you use a MS window file. Or chomp.

Does not require lot of memory, and it only need to read the file once.

I see it uses a similar logic as the one posted by Will. Didn't see it, but he posted first

I actually pilfered the idea from a blog post: http://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/ ... I use it for all of my production code since it deals well with giant (or functionally infinite) fasta files.

Same thing but a few chars shorter...

perl -le 'while(<>){/^>/?\$c++:{\$a+=length}} print \$a/\$c' myfile.fa

perl -le 'while(<>){/^>/?\$c++:{\$a+=length}} print \$a/\$c' myfile.fa

and I thought perl code properly indented was tough to read ... that might just as well be in Wingdings to me ;)

[?]perl -nle'if(/^>/){\$n++}else{\$l+=length}END{print\$l/\$n}' myfile.fa[?]

I personally like the implicit while loop and use of and END block in command line versions such as this.

perl -nle'if(/^>/){\$n++}else{\$l+=length}END{print\$l/\$n}' myfile.fa

I personally like the implicit while loop and use of and END block in command line hacks such as this.

perl -nle'if(/^>/){\$n++}else{\$l+=length}END{print\$l/\$n}' myfile.fa

I personally like the use of the -n option plus an END block for this kind of command line version.

The perl script might be not so short but quite simple:

``````my(\$n, \$length) = (0, 0);
while(<>) {
chomp;
if(/^>/){
\$n++;
}else {
\$length += length \$_;
}
}
print \$length/\$n;
``````

I LIKE to see perl written on many lines like that :) My eyes hurt less!

In fact, code like this one bring me closer to the day where I will want to learn Perl! Thanks!

I'm a relative Haskell newbie, and I'm sure this could be written more compactly or more elegantly. It's fast, though; even faster than the FLEX version, in my hands. I did see some timing strangeness that I haven't figured out, though--I thought the flex version ran faster the first time I tried it, but now it's consistently slower when I try to run it. Also, I seem to get a different mean value with the perl version than with the others I've tried. So please do reproduce.

EDIT: As Pierre Lindenbaum points out, I forgot the "-f" flag to flex; the times below are updated with that flag.

[?]

Some comparisons, on my machine:

[?]

Thanks for the nice clean problem to learn with!

Interesting. I had a look at haskell before. By the way, I compiled my flex version with "-f": the lexer is larger but faster.

B.interact takes stdin and reads it into a string. And it does that lazily, so this code doesn't ever have the entire file in memory. B.lines separates that string into a list of lines (eliminating newlines); foldl' fastaCount (0, 0) folds the "fastaCount" function over the list of lines, accumulating a count of sequences and a sum of their lengths. showMean takes the resulting count and length and returns a string with the mean, and B.pack does a bit of type conversion (turns the regular string from "show" into the bytestring that B.interact expects)

Ah, the "-f" is what I forgot. With -f, the flex version takes 0.986 seconds real time on my machine (about 0.24 seconds slower than the haskell version)

import Bio.Sequence

main = do ls <- map seqlength `fmap` readFasta "input.fasta" print (sum ss `div` length ss)

You could also use the Haskell bioinformatics library, which provides Fasta parsing and, among other things, a seqlength function.

Common Lisp version using cl-genomic (shameless self-promotion)

``````(asdf:load-system :cl-genomic)
(in-package :bio-sequence-user)

(time (with-seq-input (seqi "uniprot_sprot.fasta" :fasta :alphabet :aa
:parser (make-instance 'virtual-sequence-parser))
(loop
for seq = (next seqi)
while seq
count seq into n
sum (length-of seq) into m
finally (return (/ m n)))))

Evaluation took:
9.301 seconds of real time
9.181603 seconds of total run time (8.967636 user, 0.213967 system)
[ Run times consist of 0.791 seconds GC time, and 8.391 seconds non-GC time. ]
98.72% CPU
22,267,952,541 processor cycles
2,128,118,720 bytes consed

60943088/172805
``````

Reading the other comments, use of libraries seems controversial, so in plain Common Lisp

``````(defun mean-seq-len (file)
(declare (optimize (speed 3) (safety 0)))
(with-open-file (stream file :element-type 'base-char
:external-format :ascii)
(declare (type string line))
(find #\> line))
(line ()
(do ((count 0)
(total 0)
(length 0)
(line (line) (line)))
((null line) (/ (+ total length) count))
(declare (type fixnum count total length))
(incf count)
(incf total length)
(setf length 0))
(t
(incf length (length line))))))))

(time (mean-seq-len "uniprot_sprot.fasta"))

Evaluation took:
3.979 seconds of real time
3.959398 seconds of total run time (3.803422 user, 0.155976 system)
[ Run times consist of 0.237 seconds GC time, and 3.723 seconds non-GC time. ]
99.50% CPU
9,525,101,733 processor cycles
1,160,047,616 bytes consed

60943088/172805
``````

Thanks, I was only dreaming of seeing a Common Lisp answer to this question :)

34 chars, 45 including invocation.

``````\$ time perl -nle'/^>/?\$c++:(\$b+=length)}{print\$b/\$c' uniprot_sprot.fasta
352.669702844246

real    0m2.169s
user    0m2.048s
sys     0m0.080s
``````

Challenge accepted!

5 lines of Python (only 3 of actual work) that print the mean, given the fasta file as a command line argument. It is a bit obfuscated, but I went for brevity:

``````import sys
from Bio import SeqIO
handle = open(sys.argv[1], 'rU')
lengths = map(lambda seq: len(seq.seq), SeqIO.parse(handle, 'fasta'))
print sum(lengths)/float(len(lengths))
``````

Disclaimer: I have no idea how this will scale. Probably not great.

you could save a line by putting the open() in the SeqIO.parse() call, but my soul rebelled at that point.

Thanks :) I understand. I tend to write longer code these days in order to be clearer. After all, I'm going to have to re-read this code sometime!

It took 6.6 seconds for 400,000 sequences. Not uber fast, but usable :)

Simon, you can replace the reduce(lambda x,y: x+y, lengths) bit with just sum(lengths) to make the last line more straightforward.

If it depends on an external library, I can do in one line.

Since the question was about not re-inventing the wheel, I decided not to do it with the fasta parsing. But since that's where the biggest performance hit is, I guess I could have done better implementing it myself.

Here is another R example, this time using the excellent seqinR package:

``````library(seqinr)
mean(sapply(fasta, length))
``````

And here's a way to run it from the command line - save this code as meanLen.R:

``````library(seqinr)
mean(sapply(fasta, length))
``````

And run it using:

``````Rscript meanLen.R path/to/fasta/file
``````

Another R hack, show me the function in the library and I will upvote it.

R is open source. Feel free to examine the library code yourself :-)

Having made the shortest and slowest program, allow me to also present a longer but much faster solution in C that makes use of UNIX memory mapping to read the input file:

``````#include "sys/types.h"
#include "sys/stat.h"
#include "sys/mman.h"
#include "fcntl.h"
#include "stdio.h"
#include "unistd.h"

void main(int ARGC, char *ARGV[]) {

int fd;
struct stat fs;
char *p1, *p2, *p3;
unsigned long c1, c2;

// Map file to memory.
fd = open(ARGV[1], O_RDWR);
fstat(fd, &fs);
p1 = mmap(NULL, fs.st_size, PROT_READ, MAP_SHARED, fd, 0);
p2 = p1;
p3 = p1+fs.st_size;

// Do the actual counting.
c1 = 0;
c2 = 0;
while (p2 < p3 && *p2 != '>') ++p2;
while (*p2 == '>') {
++c1;
while (p2 < p3 && *p2 != '\n') ++p2;
while (p2 < p3 && *p2 != '>') {
if (*p2 >= 'A' && *p2 <= 'Z') ++c2;
++p2;
}
}

// Print result.
printf("File contains %d sequences with average length %.0f\n", c1, (float)c2/c1);

// Unmap and close file.
munmap(p1, fs.st_size);
close(fd);

}
``````

On my server it takes only 0.43s wall time (0.37s CPU time) to process uniprot_sprot.fasta

maybe you can put the code to compile this program?

gcc -O3 -o avglength avglength.c

It's C, but it's not ANSI :-)

please enlighten me ... which part of it is not ANSI? (ok, appart from the // that should be / / if I remember right.)

please enlighten me - which part is not ANSI-C? (ok, I know that the // comments should be / /)

"unistd.h" , open() , mmap(), "sys/*.h", "fcntl.h" are not defined in the ANSI-C spec.

but I'm just being picky :-)

something like a copy+paste from a previous question: http://biostar.stackexchange.com/questions/1718 . It won't be the shortest code, but it may be the fastest.

The following code is a FLEX lexer, with the following rules

• increment the number of sequences for each sequence line ( /^>.*n/ )
• increment the total size for each symbol of the sequence
• ignore the spaces
• raise an error for the other cases:

``````%{
#include <stdio.h>
#include <stdlib.h>
static long sequences=0;
static long size=0L;
%}
%option noyywrap

%%
^>.*\n   {
sequences++;
}

^[A-Za-z]+  {size+=yyleng;}

[ \t\n] ;

.   {
fprintf(stderr,"Illegal character  \"%s\".", yytext);
exit(EXIT_FAILURE);
}

%%

int main(int argc,char** argv)
{
yylex();
if(sequences==0) return EXIT_FAILURE;
printf("%f\n",(double)size/(double)sequences);
return EXIT_SUCCESS;
}
``````

Compilation:

``````flex -f golf.l
gcc -O3 -Wall lex.yy.c
``````

Test:

``````wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
gunzip uniprot_sprot.fasta.gz ##232Mo

time ./a.out < uniprot_sprot.fasta

352.669703

real    0m0.692s
user    0m0.572s
sys 0m0.120s
``````

definately faster:

time python mean_length.py uniprot_sprot.fasta

352.669702844

11.04s user 0.20s system 100% cpu 11.233 total

Thanks for testing Simon

In Ruby!

``````a,b=0,[];File.new(ARGV[1]).each do |i|;if i[0]!=62;a+=i.length-1;else;b.push a;a=0;end;end;puts b.inject{|s,v|s+=v}/b.length.to_f
``````

130 Chars, works w/ wrapped FASTA files.

Uncondensed:

``````a, b = 0, []
File.new(ARGV[0]).each do |i|
if i[0]!=62
a += i.length-1
else
b.push a
a=0
end
end;
puts b.inject{ |s, v| s+=v }/b.length.to_f
``````

Just now learning Ruby so help me out.

You're definitely not late, as the correct answer will be decided in 3 days and all solutions are welcomed! :) The assumption that the fasta files are not wrapped is probably not met most of the time, however. Cheers!

A Clojure version using BioJava to handle the Fasta parsing:

``````(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.java.io])

(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn lazy-avg [coll]
"Collect the count and number of values in a collection in one pass.

1 define the function that increments the sum and counts.
2 reduce the collection, updating the sum and count, and assign to variables
3 divide the sum by count, keeping it safe in case of 0 division
"
(let [f (fn [[s c] val] [(+ s val) (inc c)])
[sum cnt] (reduce f [0 0] coll)]
(if (zero? cnt) 0 (/ sum cnt))))

(when *command-line-args*
(println
(lazy-avg (apply fasta-to-lengths *command-line-args*))))
``````

and a more specific implementation without using external libraries for FASTA:

``````(use '[clojure.java.io])
(use '[clojure.contrib.str-utils2 :only (join)])

(defn fasta-lengths [in-file]
"Generate collection of FASTA record lengths, splitting at '>' delimiters"
(partition-by #(.startsWith ^String % ">"))
(filter #(not (.startsWith ^String (first %) ">")))
(map #(join "" %))
(map #(.length ^String %))))

(defn lazy-avg [coll]
"Collect the count and number of values in a collection in one pass.

1 define the function that increments the sum and counts.
2 reduce the collection, updating the sum and count, and assign to variables
3 divide the sum by count, keeping it safe in case of 0 division
"
(let [f (fn [[s c] val] [(+ s val) (inc c)])
[sum cnt] (reduce f [0 0] coll)]
(if (zero? cnt) 0 (/ sum cnt))))

(when *command-line-args*
(println
(lazy-avg (fasta-lengths (first *command-line-args*)))))
``````

I've been iterating over a couple versions of this to improve performance. This thread on StackOverflow has more details for those interested.

``````% time cljr run fasta_counter.clj uniprot_sprot.fasta PROTEIN
60943088/172805
11.84s
``````

OCaml:

[?]

To give you some idea of the performance, the natively compiled version takes about 1.4s (real time) on my laptop vs 0.6 s for the C version posted by Lars Juhl Jensen.

[?]

``````\$totalLength   = 0;
\$nbSequences   = 0;
\$averageLength = 0;

\$fastaFile = fopen("./database/uniprot_sprot.fasta", "r");
if (\$fastaFile) {
while (!feof(\$fastaFile)) {
\$line = fgets(\$fastaFile, 4096);
if(preg_match('/^>/', \$line)) {
\$nbSequences++;
}
else {
\$totalLength += strlen(trim(\$line));
}

}
fclose(\$fastaFile);
}

\$averageLength = (\$totalLength / \$nbSequences);

print "Total sequences : " . \$nbSequences   . "\n";
print "Total length    : " . \$totalLength   . "\n";
print "Average length  : " . \$averageLength . "\n";
``````

Output :

[?]

[?]

[?]

Can I post more than one?

grep/awk:

``````grep '^[GATC].*' f | awk '{sum+=length(\$0)}END{print sum/NR}'
``````

where f is the filename ;)

62 chars, btw.

dangit... this too only works for unwrapped fasta files. shoot me!

Erlang special golfing 213 chars version:

``````-module(g).
-export([s/0]).
s()->open_port({fd,0,1},[in,binary,{line,256}]),r(0,0),halt().
``````

``````-module(g).
-export([s/0]).
s()->
P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
r(P, 0, 0),
halt().
r(P, C, L) ->
{P, {data, {eol, <<\$>:8, _/binary>>}}} ->
r(P, C+1, L);
{P, {data, {eol, Line}}} ->
r(P, C, L + size(Line));
{'EXIT', P, normal} ->
io:format("~p~n",[L/C]);
end.
``````

Compile:

``````\$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl
``````

Invocation:

``````\$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s
``````

Another answer. Last time I used FLEX, I now use the GNU BISON parser generator. Here I've created a simple grammar to describe a Fasta File.Of course this solution is slower than Flex :-)

%{

# include <ctype.h>

static long sequences=0L; static long total=0L; %} %error-verbose %union { int count; }

%token LT %token<count> SYMBOL %token CRLF %token OTHER

%start file

%{ void yyerror(const char* message) { fprintf(stderr,"Error %s\n",message); }

int yylex() { int c=fgetc(stdin); if(c==-1) return EOF; if(c=='>') return LT; if(c=='\n') return CRLF; if(isalpha(c)) return SYMBOL; return OTHER; } %} %%

file: fasta | file fasta; fasta: header body { ++sequences; }; header: LT noncrlfs crlfs; body: line | body line; line: symbols crlfs; symbols: symbol {++total;}| symbols symbol {++total;}; symbol: SYMBOL; crlfs: crlf | crlfs crlf; crlf: CRLF; noncrlfs:noncrlf | noncrlfs noncrlf; noncrlf: SYMBOL| OTHER | LT; %%

int main(int argc,char *argv) { yyparse(); if(sequences>0L) fprintf(stdout,"%f\n",(total/(1.0sequences))); return 0; }

Compilation:

bison golf.y gcc -Wall -O3 golf.tab.c

Test:

time ./a.out < uniprot_sprot.fasta 352.669703

real0m9.723s user0m9.633s sys0m0.080s

And now, with awk:

``````\$awk '\$1!~ /^>/ {total += length; count++} END {print total/count}' uniprot_sprot.fasta
``````

Still not smoking fast but down to one line (for certain values of "one" and "line")

EDIT

And,as Pierre points out below, I got this wrong again, as it doesn't account for wrapped lines. This one does work:

`\$ awk '{if (/^>/) {record ++} else {len += length}} END {print len/record}' uniprot_sprot.fasta`

It returns the mean length of the fasta lines. We need the mean length of the whole fasta sequences.

right, should have worked that out really (i'd tested it on an unwrapped fasta file)

Using MYSQL, yes we can:

``````create temporary table T1(seq varchar(255));
LOAD data local infile "uniprot_sprot.fasta" INTO TABLE T1 (seq);
select @SEQ:=count(*) from T1 where left(seq,1)=">";
select @TOTAL:=sum(length(trim(seq))) from T1 where left(seq,1)!=">";
select @TOTAL/@SEQ;
``````

Result:

``````me@linux-zfgk:~> time mysql -u root -D test < golf.sql
@SEQ:=count(*)
518415
@TOTAL:=sum(length(trim(seq)))
182829264
@TOTAL/@SEQ
352.669702844000000000000000000000

real    0m44.224s
user    0m0.008s
sys     0m0.008s
``````

Here's another python one, it's really the same as Simon's but using list expressions instead of lamdas (I find them more readable, you might not)

``````import sys
from Bio import SeqIO
lengths = [len(seq) for seq in SeqIO.parse(sys.argv[1], "fasta")]
print sum(lengths)/float(len(lengths))
``````

For SeqIO to accept filenames as strings (instead of handles) you need biopython 1.54.

And I'm sure this will be slower than any of the other options already presented ;)

EDIT:

As Eric points out below this pumps all the lengths into one big list, which isn't exactly memory extensive ;)

I can hang on to my completely irrational fear of lamda/map and itertools with a generator expression:

``````lengths = (len(seq) for seq in SeqIO.parse(sys.argv[1], "fasta"))
total= 0
for index, length in enumerate(lengths):
total += length
print total/float(index + 1)
``````

Still pretty slow, about a minute on computer, so probably not a goer for the original challenge!

@david. Slow, I think, is not the problem here. For very large fasta files, your code is going to create a list of many thousand hundreds (or millions) of integers, which is going to be memory intensive. The best thing to do I guess would be to test it with the data that Pierre used for his test, the uniprot_sprot.fasta file :) Cheers

Oh, yeah, Read the specficication ...

Just added a generator expression version, interestingly enough the uniprot_sprot file doesn't make too much of a footprint with ~500 000 integers. But both methods are really too slow for what you want to do ;(

I guess I should have tested the memory footprint before putting the comment in the first place. A list of 50,000,000 integers will take about 1 Gb or so (just saturated the memory of my old faithful laptop trying).

A third variation, this time using Javacc (the Java Compiler Compiler):

``````    options { STATIC=false;}

PARSER_BEGIN(Golf)
public class Golf
{
private int sequences=0;
private int total=0;
public static void main(String args[]) throws Exception
{
new Golf(new java.io.BufferedInputStream(System.in)).input();
}
}
PARSER_END(Golf)

TOKEN: /*RESERVED TOKENS FOR UQL */
{
<GT: ">">
|  <CRLF:  ("\n")+>
|  <SYMBOLS: (["A"-"Z", "a"-"z"])+>
|  <OTHER: (~["A"-"Z", "a"-"z","\n",">"]) >
}

void input():{} { (fasta() )+ {System.out.println(total/(double)sequences);} }
void header():{} {  <GT> (<SYMBOLS>|<OTHER>|<GT>)* <CRLF> {sequences++;}}
void line():{Token t;} {  t=<SYMBOLS> <CRLF> { total+=t.image.length(); }}
``````

Compile

``````            javacc Golf.javacc
javac Golf.java
``````

Test

``````            time java Golf < uniprot_sprot.fasta
352.6697028442464

real    0m11.014s
user    0m11.529s
sys 0m0.180s
``````

Perl 6. Painfully slow, so I'm probably doing something dumb that isn't obvious to me.

[?]

one suggestion: in the case of multi-line FASTA files, the more common scenario is that any given line will NOT be a header line. thus, one speedup would be to test for "ne '>'" first in the while loop. not sure what speeds you're seeing but you could also try to "slurp" the file into memory first.

OK I'll give it a go. It's still much slower than Perl5 doing the same thing, i.e. reading line by line.

Maybe a grammar would be the best option? It is Perl 6...

I might give a grammar a go but I don't expect it'll help much. Even basic I/O - reading a file line by line = is still an order of magnitude slower than in Perl 5.

another [?]fastest[?] solution in 'C', with a lot of micro-optimization

``````#include <stdio.h>
#define BUFFER_SIZE 10000000
int main(int argc,char** argv)
{
size_t i;
char buffer[BUFFER_SIZE];
long total=0L;
long sequences=0L;
size_t inseq=1;
{
{
switch(buffer[i])
{
case '>': if(inseq==1) ++sequences; inseq=0;break;
case '\n': inseq=1;break;
case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
case 'G': case 'H': case 'I': case 'J': case 'K': case 'L':
case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R':
case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
case 'Y': case 'Z':
case 'a': case 'b': case 'c': case 'd': case 'e': case 'f':
case 'g': case 'h': case 'i': case 'j': case 'k': case 'l':
case 'm': case 'n': case 'o': case 'p': case 'q': case 'r':
case 's': case 't': case 'u': case 'v': case 'w': case 'x':
case 'y': case 'z': if(inseq==1) ++total;break;
default: break;
}
}
}
printf("%f\n",(total/(1.0*sequences)));
return 0;
}
``````

Compilation:

``````gcc -O3 -Wall golf.c
``````

Result:

``````time ./a.out < uniprot_sprot.fasta
352.669703

real    0m0.952s
user    0m0.808s
sys 0m0.140s
``````

Pierre, not to blow my own horn, but on my machine your C implementation is more than 2x slower than the one I posted ;)

oh nooooooo :-)

but my solution is ANSI :-))

TouchÃ©! I guess the price of standard compliance is 2-3x loss in speed? Sounds about right to me ;-)