Mean Length Of Fasta Sequences
28
23
Entering edit mode
10.9 years ago

Hi,

I was about to re-invent the wheel, again, when I thought that there are probably many among you who had already solved this problem.

I have a few huge fasta files (half a million sequences) and I would like to know the average length of the sequences for each of these files, one at a time.

So, let's call this a code golf. The goal is to make a very short code to accomplish the following:

Given a fasta file, return the average length of the sequences.

The correct answer will go to the answer with the most votes on friday around 16h in Quebec (Eastern Time).

You can use anything that can be run on a linux terminal (your favorite language, emboss, awk...), diversity will be appreciated :)

Cheers

CONCLUSION:

And the correct answer goes to the most voted question, as promised :) Thank you all for your great participation! I think I am not the only one to have appreciated the great diversity in the answers, as well as very interesting discussions and ideas.

EDIT:

Although the question is now closed, do not hesitate to vote for all the answers you found interesting, especially those at the bottom! They are not necessarily less interesting. They often have only arrived later :)

Thanks again!

code fasta sequence codegolf • 17k views
5
Entering edit mode

A series of "code golf" tournaments might be a fun way to seed the proposed "Project Mendel"

2
Entering edit mode

I love the answers to this question! Amazing diversity: we get the perl, python, R scripts then .... whoa ... flex, clojure, erlang, haskell, ocaml, memory mapped C files.

2
Entering edit mode

This is one of the best posts in the forum! Opened my eyes to the full universe of programming approaches. Can I +1 code golf as well as this post?

1
Entering edit mode

Create a 'flat' fasta for the human genome that for each chromosome contains the entire sequence as a single line. Now run the tools below and see which one can still do it.

1
Entering edit mode

I would be incredibly impressed to see an answer in BF, Var'aq or Lolcode ... not for usability but just to see a "real-word" application.

1
Entering edit mode

Kind of what I had in mind :) Something like doing 1 golf-code per week and slowly building up a set of fun problems. Maybe there should be and off-biostars group discussing problem propositions and formulation. For example forming a google group... What do you think?

24
Entering edit mode
10.9 years ago
Cassj ★ 1.3k

Bioconductor?

library(Biostrings)
sum(width(s)) / length(s)


@Stefano: Yeah, I think the above does just read everything in. Seems to be able to cope with pretty big files, but actually maybe this is better:

library(Biostrings)
s <- fasta.info(filename)
sum(s)/length(s)

4
Entering edit mode

Fair enough, although I would argue that's it's the opposite of hacky to use an existing, well supported, open source library. Sure, it's useful to see how to implement solutions to a problem in different languages and in different ways, but it's also fairly handy to know if someone has already done all the work for you somewhere.

3
Entering edit mode

Am I wrong or this code would read the whole file in memory? with Huge fasta file it might be problem

3
Entering edit mode

The benefit of using high level languages like R is specifically that they have libraries to help make these kind of tasks easy. Why force folks to re-implement a Fasta parser when a robust solution exists? The interesting thing about this post is the wide variety of approaches; you would take that away by putting restrictions on folks creativity.

3
Entering edit mode

I'd point out that the original question uses the phrase "reinvent the wheel" and specifies any command-line solution, including emboss. Libraries do not "do everything". You need to know where to find them, how to install them and how to write code around them. These are core bioinformatics skills. Writing code from first principles, on the other hand, is a computer science skill. Personally, I favour quick, elegant and practical solutions to problems over computational theory.

1
Entering edit mode

There are no rules, but if someone wants to use a library that does everything, then we don't have different approaches or code, we just have libraries and hacks. That's my own perspective.

1
Entering edit mode

Indeed, maybe next time I propose a code golf, I'll ask people to use their favorite language natively, w/o supplementary libraries. Not because it is functionally better, but maybe didactically preferable :) It also shows better how different approaches/paradigms/philosophies of programming give birth to a lot of diversity! Cheers

1
Entering edit mode

@CassJ I personally find nothing wrong with any of the answers posted here. I just have some thoughts that, for further questions, and in the context that these could be used to fuel a maybe-to-be-called Project Mendel, I may be inclined to ask people to use only with native code from their favorite language! Thanks for your answer :)

1
Entering edit mode

One day we won't need to code, we will have only libraries. I dream of that day, when other people can do 100% of the work for me.

0
Entering edit mode

I always forget about R and bioconductor

0
Entering edit mode

and littleR could make this command-line-able

0
Entering edit mode

Newish R can do it natively with the Rscript utility.

0
Entering edit mode

This could be a winner. Mean = stats, stats = R.

0
Entering edit mode

Hmm, I think you're right. It seems to be able to cope with pretty big files though. Maybe this instead...?

library(Biostrings)
s<-fasta.info(filename)
sum(s)/length(s)

0
Entering edit mode

Not to mention that the "code" itself is just a hack, as it depends on a library. The real code would be to see the actual library function that reads the file and parse the data. I could do this in any language as long as I had a library that would do everything for me.

0
Entering edit mode

Ah, maybe I missed the point - I thought the OP was just looking for a quick way to solve the problem. Are the rules that you can only use core features of a language? In that case, I guess I'd go with Perl.

0
Entering edit mode

@Eric - Project Mendel is a grand idea :) If it's mainly focusing on bioinformatics algorithms then maybe there's a similar need for a bioinformatics CookBook with JFDI implementations of common tasks in different languages?

0
Entering edit mode

@CassJ This whole discussion about native code vs. use of modules led me to the very same conclusion earlier today. Namely, that 2 different projects were probably needed. I had, not seriously, thought it could be the "Mendel CookBook" :P More seriously, these 2 projects MUST come into existence. They would be a great asset for the developpement of the bioinformatics community among less computer-oriented people among biologists (and maybe as well among less biology-oriented people among informaticians :)

0
Entering edit mode

This version is almost twice longer than mine perl version and is even longer than perl version including invocation. LOL, is it still code golf contest?

14
Entering edit mode
10.9 years ago

How about the following very simple AWK script?

awk '{/>/&&++a||b+=length()}END{print b/a}' file.fa


43 characters only (not counting file name). AWK is not known for its speed, though.

0
Entering edit mode

Indeed, this is the slowest I tried so far! 18secs for the uniprot_sprot.fasta file. Funny, when I asked the question, I thought this approach, being 'closer to linux', would be the fastest. We learn everyday :)

12
Entering edit mode
10.9 years ago
Neilfws 49k

And just for completeness, a version using the BioRuby library:

#!/usr/bin/ruby
require "rubygems"
require "bio"
seqLen = []

Bio::FlatFile.open(ARGV[0]) do |ff|
ff.each do |seq|
seqLen << seq.length
end
end

puts (seqLen.inject(0) { |sum, element| sum + element }).to_f / seqLen.size

0
Entering edit mode

You can get rid of the (0) after inject, I think.

0
Entering edit mode

You could. It just sets an initial value for sum.

0
Entering edit mode

I voted +1 (regretfully) for your anwser Neil ;-)

0
Entering edit mode

Very gracious, I may return the favour :-)

12
Entering edit mode
10.9 years ago

Common Lisp version using cl-genomic (shameless self-promotion)

(asdf:load-system :cl-genomic)
(in-package :bio-sequence-user)

(time (with-seq-input (seqi "uniprot_sprot.fasta" :fasta :alphabet :aa
:parser (make-instance 'virtual-sequence-parser))
(loop
for seq = (next seqi)
while seq
count seq into n
sum (length-of seq) into m
finally (return (/ m n)))))

Evaluation took:
9.301 seconds of real time
9.181603 seconds of total run time (8.967636 user, 0.213967 system)
[ Run times consist of 0.791 seconds GC time, and 8.391 seconds non-GC time. ]
98.72% CPU
22,267,952,541 processor cycles
2,128,118,720 bytes consed

60943088/172805


Reading the other comments, use of libraries seems controversial, so in plain Common Lisp

(defun mean-seq-len (file)
(declare (optimize (speed 3) (safety 0)))
(with-open-file (stream file :element-type 'base-char
:external-format :ascii)
(declare (type string line))
(find #\> line))
(line ()
(do ((count 0)
(total 0)
(length 0)
(line (line) (line)))
((null line) (/ (+ total length) count))
(declare (type fixnum count total length))
(incf count)
(incf total length)
(setf length 0))
(t
(incf length (length line))))))))

(time (mean-seq-len "uniprot_sprot.fasta"))

Evaluation took:
3.979 seconds of real time
3.959398 seconds of total run time (3.803422 user, 0.155976 system)
[ Run times consist of 0.237 seconds GC time, and 3.723 seconds non-GC time. ]
99.50% CPU
9,525,101,733 processor cycles
1,160,047,616 bytes consed

60943088/172805

1
Entering edit mode

Thanks, I was only dreaming of seeing a Common Lisp answer to this question :)

11
Entering edit mode
10.9 years ago
Will 4.5k

Since it uses itertools and never loads anything into memory this should scale well for even giant fasta files.

from itertools import groupby, imap

tot = 0
num = 0
with open(filename) as handle:
for header, group in groupby(handle, lambda x:x.startswith('>')):
num += 1
tot += sum(imap(lambda x: len(x.strip()), group))
result = float(tot)/num

0
Entering edit mode

2.55 seconds for 400,000 sequences. Not bad! I think there is a mistake with the imap function. It should be map :) (I change it)

0
Entering edit mode

Actually its supposed to be imap but also need another import form itertools ... the nice thing about imap is that it won't load the whole iterator into memory at once, it will process the iterator as requested. This is REALLY useful when dealing with fasta files of contigs from many genomes.

0
Entering edit mode

I also corrected that the result should be a float so I put float(tot) instead of tot in the result.

0
Entering edit mode

one of these days I'll remember that Python doesn't auto-convert on division ... that gotcha has bitten me in the ass many times before, thank goodness for unit-testing.

11
Entering edit mode
10.9 years ago
Neilfws 49k

Here is another R example, this time using the excellent seqinR package:

library(seqinr)
mean(sapply(fasta, length))


And here's a way to run it from the command line - save this code as meanLen.R:

library(seqinr)
mean(sapply(fasta, length))


And run it using:

Rscript meanLen.R path/to/fasta/file

3
Entering edit mode

R is open source. Feel free to examine the library code yourself :-)

1
Entering edit mode

Another R hack, show me the function in the library and I will upvote it.

0
Entering edit mode

Here is an alternative using ape package, which I use way more often:

library(ape)
fasta <- read.dna(commandArgs(trailingOnly = T), "fasta")
mean(sapply(fasta, length))


Run it the same way as your script ;)

10
Entering edit mode
10.9 years ago

This seem to do it. straight from the command line.

Obviously, being perl, it is compact but a bit obfuscate (also in the reasoning ;))

perl -e 'while(<>){if (/^>/){$c++}else{$a+=length($_)-1}} print$a/$c' myfile.fa  It basically counts all nucleotides (in $a) and the number of sequences (in $c). At the end divides the number of nucleotides by the number of sequences, getting the average length. $a+=(length($_)-1)  -1 because there is the newline to take into account. use -2 if you use a MS window file. Or chomp. Does not require lot of memory, and it only need to read the file once. ADD COMMENT 2 Entering edit mode perl -le 'while(<>){/^>/?$c++:{$a+=length}} print$a/$c' myfile.fa  ADD REPLY 1 Entering edit mode perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa  ADD REPLY 0 Entering edit mode I see it uses a similar logic as the one posted by Will. Didn't see it, but he posted first ADD REPLY 0 Entering edit mode I actually pilfered the idea from a blog post ... I use it for all of my production code since it deals well with giant (or functionally infinite) fasta files. ADD REPLY 0 Entering edit mode Same thing but a few chars shorter... perl -le 'while(<>){/^>/?$c++:{$a+=length}} print$a/$c' myfile.fa  ADD REPLY 0 Entering edit mode and I thought perl code properly indented was tough to read ... that might just as well be in Wingdings to me ;) ADD REPLY 0 Entering edit mode perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa  I personally like the implicit while loop and use of and END block in command line versions such as this. ADD REPLY 0 Entering edit mode perl -nle'if(/^>/){$n++}else{$l+=length}END{print$l/$n}' myfile.fa  I personally like the implicit while loop and use of and END block in command line hacks such as this. ADD REPLY 0 Entering edit mode I personally like the use of the -n option plus an END block for this kind of command line version. ADD REPLY 10 Entering edit mode 10.9 years ago Yang ▴ 190 The perl script might be not so short but quite simple: my($n, $length) = (0, 0); while(<>) { chomp; if(/^>/){$n++;
}else {
$length += length$_;
}
}
print $length/$n;

1
Entering edit mode

I LIKE to see perl written on many lines like that :) My eyes hurt less!

1
Entering edit mode

In fact, code like this one bring me closer to the day where I will want to learn Perl! Thanks!

0
Entering edit mode

Hi! I used this perl script and it solved the issue of estimating the average length. However I also need to estimate the standard deviation of the sequence lengths in the fasta file. Does anybody knows how to add this estimation to the aforementioned script? Thank you in advance! :)

10
Entering edit mode
10.9 years ago
Mitch Skinner ▴ 660

I'm a relative Haskell newbie, and I'm sure this could be written more compactly or more elegantly. It's fast, though; even faster than the FLEX version, in my hands. I did see some timing strangeness that I haven't figured out, though--I thought the flex version ran faster the first time I tried it, but now it's consistently slower when I try to run it. Also, I seem to get a different mean value with the perl version than with the others I've tried. So please do reproduce.

EDIT: As Pierre Lindenbaum points out, I forgot the "-f" flag to flex; the times below are updated with that flag.

[?]

Some comparisons, on my machine:

[?]

Thanks for the nice clean problem to learn with!

1
Entering edit mode

Interesting. I had a look at haskell before. By the way, I compiled my flex version with "-f": the lexer is larger but faster.

1
Entering edit mode

You could also use the Haskell bioinformatics library, which provides Fasta parsing and, among other things, a seqlength function.

0
Entering edit mode

B.interact takes stdin and reads it into a string. And it does that lazily, so this code doesn't ever have the entire file in memory. B.lines separates that string into a list of lines (eliminating newlines); foldl' fastaCount (0, 0) folds the "fastaCount" function over the list of lines, accumulating a count of sequences and a sum of their lengths. showMean takes the resulting count and length and returns a string with the mean, and B.pack does a bit of type conversion (turns the regular string from "show" into the bytestring that B.interact expects)

0
Entering edit mode

Ah, the "-f" is what I forgot. With -f, the flex version takes 0.986 seconds real time on my machine (about 0.24 seconds slower than the haskell version)

0
Entering edit mode
import Bio.Sequence
main = do
ls <- map seqlength fmap readFasta "input.fasta"
print (sum ss div length ss)

9
Entering edit mode
10.9 years ago

34 chars, 45 including invocation.

time perl -nle'/^>/?$c++:($b+=length)}{print$b/$c' uniprot_sprot.fasta
352.669702844246

real    0m2.169s
user    0m2.048s
sys     0m0.080s

8
Entering edit mode
10.9 years ago

Challenge accepted!

5 lines of Python (only 3 of actual work) that print the mean, given the fasta file as a command line argument. It is a bit obfuscated, but I went for brevity:

import sys
from Bio import SeqIO
handle = open(sys.argv[1], 'rU')
lengths = map(lambda seq: len(seq.seq), SeqIO.parse(handle, 'fasta'))
print sum(lengths)/float(len(lengths))


Disclaimer: I have no idea how this will scale. Probably not great.

1
Entering edit mode

Simon, you can replace the reduce(lambda x,y: x+y, lengths) bit with just sum(lengths) to make the last line more straightforward.

0
Entering edit mode

You could save a line by putting the open() in the SeqIO.parse() call, but my soul rebelled at that point.

0
Entering edit mode

Thanks :) I understand. I tend to write longer code these days in order to be clearer. After all, I'm going to have to re-read this code sometime!

0
Entering edit mode

It took 6.6 seconds for 400,000 sequences. Not uber fast, but usable :)

0
Entering edit mode

Brad, of course, I always forget about sum()... Changed for the increased readability.

0
Entering edit mode

If it depends on an external library, I can do in one line.

0
Entering edit mode

Since the question was about not re-inventing the wheel, I decided not to do it with the fasta parsing. But since that's where the biggest performance hit is, I guess I could have done better implementing it myself.

8
Entering edit mode
10.9 years ago

Having made the shortest and slowest program, allow me to also present a longer but much faster solution in C that makes use of UNIX memory mapping to read the input file:

#include "sys/types.h"
#include "sys/stat.h"
#include "sys/mman.h"
#include "fcntl.h"
#include "stdio.h"
#include "unistd.h"

void main(int ARGC, char *ARGV[]) {

int fd;
struct stat fs;
char *p1, *p2, *p3;
unsigned long c1, c2;

// Map file to memory.
fd = open(ARGV[1], O_RDWR);
fstat(fd, &fs);
p1 = mmap(NULL, fs.st_size, PROT_READ, MAP_SHARED, fd, 0);
p2 = p1;
p3 = p1+fs.st_size;

// Do the actual counting.
c1 = 0;
c2 = 0;
while (p2 < p3 && *p2 != '>') ++p2;
while (*p2 == '>') {
++c1;
while (p2 < p3 && *p2 != '\n') ++p2;
while (p2 < p3 && *p2 != '>') {
if (*p2 >= 'A' && *p2 <= 'Z') ++c2;
++p2;
}
}

// Print result.
printf("File contains %d sequences with average length %.0f\n", c1, (float)c2/c1);

// Unmap and close file.
munmap(p1, fs.st_size);
close(fd);

}


On my server it takes only 0.43s wall time (0.37s CPU time) to process uniprot_sprot.fasta

0
Entering edit mode

maybe you can put the code to compile this program?

0
Entering edit mode
gcc -O3 -o avglength avglength.c

0
Entering edit mode

It's C, but it's not ANSI :-)

0
Entering edit mode

please enlighten me ... which part of it is not ANSI? (ok, apart from the // that should be /* */ if I remember right.)

0
Entering edit mode

please enlighten me - which part is not ANSI-C? (ok, I know that the // comments should be /* */)

0
Entering edit mode

unistd.h , open() , mmap(), sys/*.h, fcntl.h are not defined in the ANSI-C spec.

0
Entering edit mode

but I'm just being picky :-)

7
Entering edit mode
10.9 years ago

something like a copy+paste from a previous question. It won't be the shortest code, but it may be the fastest. The following code is a FLEX lexer, with the following rules

• increment the number of sequences for each sequence line ( /^>.*n/ )
• increment the total size for each symbol of the sequence
• ignore the spaces
• raise an error for the other cases:

%{
#include <stdio.h>
#include <stdlib.h>
static long sequences=0;
static long size=0L;
%}
%option noyywrap

%%
^>.*\n   {
sequences++;
}

^[A-Za-z]+  {size+=yyleng;}

[ \t\n] ;

.   {
fprintf(stderr,"Illegal character  \"%s\".", yytext);
exit(EXIT_FAILURE);
}

%%

int main(int argc,char** argv)
{
yylex();
if(sequences==0) return EXIT_FAILURE;
printf("%f\n",(double)size/(double)sequences);
return EXIT_SUCCESS;
}


Compilation:

flex -f golf.l
gcc -O3 -Wall lex.yy.c


Test:

wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
gunzip uniprot_sprot.fasta.gz ##232Mo

time ./a.out < uniprot_sprot.fasta

352.669703

real    0m0.692s
user    0m0.572s
sys 0m0.120s

1
Entering edit mode

Definitely faster:

time python mean_length.py uniprot_sprot.fasta

352.669702844
11.04s user 0.20s system 100% cpu 11.233 total

0
Entering edit mode

Thanks for testing Simon

7
Entering edit mode
10.9 years ago
Science_Robot ★ 1.1k

In Ruby!

a,b=0,[];File.new(ARGV[1]).each do |i|;if i[0]!=62;a+=i.length-1;else;b.push a;a=0;end;end;puts b.inject{|s,v|s+=v}/b.length.to_f


130 Chars, works w/ wrapped FASTA files.

Uncondensed:

a, b = 0, []
File.new(ARGV[0]).each do |i|
if i[0]!=62
a += i.length-1
else
b.push a
a=0
end
end;
puts b.inject{ |s, v| s+=v }/b.length.to_f


Just now learning Ruby so help me out.

0
Entering edit mode

You're definitely not late, as the correct answer will be decided in 3 days and all solutions are welcomed! :) The assumption that the fasta files are not wrapped is probably not met most of the time, however. Cheers!

7
Entering edit mode
10.9 years ago

A Clojure version using BioJava to handle the Fasta parsing:

(import 'org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.java.io])

(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn lazy-avg [coll]
"Collect the count and number of values in a collection in one pass.

1 define the function that increments the sum and counts.
2 reduce the collection, updating the sum and count, and assign to variables
3 divide the sum by count, keeping it safe in case of 0 division
"
(let [f (fn [[s c] val] [(+ s val) (inc c)])
[sum cnt] (reduce f [0 0] coll)]
(if (zero? cnt) 0 (/ sum cnt))))

(when *command-line-args*
(println
(lazy-avg (apply fasta-to-lengths *command-line-args*))))


and a more specific implementation without using external libraries for FASTA:

(use '[clojure.java.io])
(use '[clojure.contrib.str-utils2 :only (join)])

(defn fasta-lengths [in-file]
"Generate collection of FASTA record lengths, splitting at '>' delimiters"
(partition-by #(.startsWith ^String % ">"))
(filter #(not (.startsWith ^String (first %) ">")))
(map #(join "" %))
(map #(.length ^String %))))

(defn lazy-avg [coll]
"Collect the count and number of values in a collection in one pass.

1 define the function that increments the sum and counts.
2 reduce the collection, updating the sum and count, and assign to variables
3 divide the sum by count, keeping it safe in case of 0 division
"
(let [f (fn [[s c] val] [(+ s val) (inc c)])
[sum cnt] (reduce f [0 0] coll)]
(if (zero? cnt) 0 (/ sum cnt))))

(when *command-line-args*
(println
(lazy-avg (fasta-lengths (first *command-line-args*)))))


I've been iterating over a couple versions of this to improve performance. This thread on StackOverflow has more details for those interested.

% time cljr run fasta_counter.clj uniprot_sprot.fasta PROTEIN
60943088/172805
11.84s

7
Entering edit mode
10.9 years ago

Erlang special golfing 213 chars version:

-module(g).
-export([s/0]).
s()->open_port({fd,0,1},[in,binary,{line,256}]),r(0,0),halt().
r(C,L)->receive{_,{_,{_,<<$>:8,_/binary>>}}}->r(C+1,L);{_,{_,{_,Line}}}->r(C,L+size(Line));_->io:format("~p~n",[L/C])end.  Readable but reliable version: -module(g). -export([s/0]). s()-> P = open_port({fd, 0, 1}, [in, binary, {line, 256}]), r(P, 0, 0), halt(). r(P, C, L) -> receive {P, {data, {eol, <<$>:8, _/binary>>}}} ->
r(P, C+1, L);
{P, {data, {eol, Line}}} ->
r(P, C, L + size(Line));
{'EXIT', P, normal} ->
io:format("~p~n",[L/C]);
end.


Compile:

$erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl  Invocation: $ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s

7
Entering edit mode
10.9 years ago
Tom Walsh ▴ 550

OCaml:

[?]

To give you some idea of the performance, the natively compiled version takes about 1.4s (real time) on my laptop vs 0.6 s for the C version posted by Lars Juhl Jensen.

6
Entering edit mode
10.9 years ago

[?]

$totalLength = 0;$nbSequences   = 0;
$averageLength = 0;$fastaFile = fopen("./database/uniprot_sprot.fasta", "r");
if ($fastaFile) { while (!feof($fastaFile)) {
$line = fgets($fastaFile, 4096);
if(preg_match('/^>/', $line)) {$nbSequences++;
}
else {
$totalLength += strlen(trim($line));
}

}
fclose($fastaFile); }$averageLength = ($totalLength /$nbSequences);

print "Total sequences : " . $nbSequences . "\n"; print "Total length : " .$totalLength   . "\n";
print "Average length  : " . $averageLength . "\n";  Output : [?] [?] [?] ADD COMMENT 6 Entering edit mode 10.9 years ago Science_Robot ★ 1.1k Can I post more than one? grep/awk: grep '^[GATC].*' f | awk '{sum+=length($0)}END{print sum/NR}'


where f is the filename ;)

62 chars, btw.

0
Entering edit mode

dangit... this too only works for unwrapped fasta files. shoot me!

5
Entering edit mode
10.9 years ago

Another answer. Last time I used FLEX, I now use the GNU BISON parser generator. Here I've created a simple grammar to describe a Fasta File.Of course this solution is slower than Flex :-)

%{

#include <stdio.h>
#include <ctype.h>

static long sequences=0L;
static long total=0L;
%}
%error-verbose
%union {
int count;
}

%token LT
%token<count> SYMBOL
%token CRLF
%token OTHER

%start file

%{
void yyerror(const char* message)
{
fprintf(stderr,"Error %s\n",message);
}

int yylex()
{
int c=fgetc(stdin);
if(c==-1) return EOF;
if(c=='>') return LT;
if(c=='\n') return CRLF;
if(isalpha(c)) return SYMBOL;
return OTHER;
}
%}
%%

file:  fasta | file fasta;
fasta: header body { ++sequences; };
body: line |  body line;
line: symbols crlfs;
symbols: symbol {++total;}| symbols symbol {++total;};
symbol: SYMBOL;
crlfs: crlf | crlfs crlf;
crlf: CRLF;
noncrlfs:noncrlf | noncrlfs noncrlf;
noncrlf: SYMBOL| OTHER | LT;
%%

int main(int argc,char **argv)
{
yyparse();
if(sequences>0L) fprintf(stdout,"%f\n",(total/(1.0*sequences)));
return 0;
}


Compilation:

bison golf.y
gcc -Wall -O3 golf.tab.c


Test:

time ./a.out  < uniprot_sprot.fasta
352.669703

real0m9.723s
user0m9.633s
sys0m0.080s

5
Entering edit mode
10.9 years ago
David W 4.8k

And now, with awk:

$awk '$1!~ /^>/ {total += length; count++} END {print total/count}' uniprot_sprot.fasta


Still not smoking fast but down to one line (for certain values of "one" and "line")

EDIT

And,as Pierre points out below, I got this wrong again, as it doesn't account for wrapped lines. This one does work:

$awk '{if (/^>/) {record ++} else {len += length}} END {print len/record}' uniprot_sprot.fasta  ADD COMMENT 4 Entering edit mode It returns the mean length of the fasta lines. We need the mean length of the whole fasta sequences. ADD REPLY 0 Entering edit mode right, should have worked that out really (i'd tested it on an unwrapped fasta file) ADD REPLY 5 Entering edit mode 10.9 years ago Using MYSQL, yes we can: create temporary table T1(seq varchar(255)); LOAD data local infile "uniprot_sprot.fasta" INTO TABLE T1 (seq); select @SEQ:=count(*) from T1 where left(seq,1)=">"; select @TOTAL:=sum(length(trim(seq))) from T1 where left(seq,1)!=">"; select @TOTAL/@SEQ;  Result: me@linux-zfgk:~> time mysql -u root -D test < golf.sql @SEQ:=count(*) 518415 @TOTAL:=sum(length(trim(seq))) 182829264 @TOTAL/@SEQ 352.669702844000000000000000000000 real 0m44.224s user 0m0.008s sys 0m0.008s  ADD COMMENT 4 Entering edit mode 10.9 years ago David W 4.8k Here's another python one, it's really the same as Simon's but using list expressions instead of lamdas (I find them more readable, you might not) import sys from Bio import SeqIO lengths = [len(seq) for seq in SeqIO.parse(sys.argv[1], "fasta")] print sum(lengths)/float(len(lengths))  For SeqIO to accept filenames as strings (instead of handles) you need biopython 1.54. And I'm sure this will be slower than any of the other options already presented ;) EDIT: As Eric points out below this pumps all the lengths into one big list, which isn't exactly memory extensive ;) I can hang on to my completely irrational fear of lamda/map and itertools with a generator expression: lengths = (len(seq) for seq in SeqIO.parse(sys.argv[1], "fasta")) total= 0 for index, length in enumerate(lengths): total += length print total/float(index + 1)  Still pretty slow, about a minute on computer, so probably not a goer for the original challenge! ADD COMMENT 0 Entering edit mode @david. Slow, I think, is not the problem here. For very large fasta files, your code is going to create a list of many thousand hundreds (or millions) of integers, which is going to be memory intensive. The best thing to do I guess would be to test it with the data that Pierre used for his test, the uniprot_sprot.fasta file :) Cheers ADD REPLY 0 Entering edit mode Oh, yeah, Read the specficication ... Just added a generator expression version, interestingly enough the uniprot_sprot file doesn't make too much of a footprint with ~500 000 integers. But both methods are really too slow for what you want to do ;( ADD REPLY 0 Entering edit mode I guess I should have tested the memory footprint before putting the comment in the first place. A list of 50,000,000 integers will take about 1 Gb or so (just saturated the memory of my old faithful laptop trying). ADD REPLY 4 Entering edit mode 10.9 years ago A third variation, this time using Javacc (the Java Compiler Compiler): options { STATIC=false;} PARSER_BEGIN(Golf) public class Golf { private int sequences=0; private int total=0; public static void main(String args[]) throws Exception { new Golf(new java.io.BufferedInputStreamSystem.in)).input(); } } PARSER_END(Golf) TOKEN: /*RESERVED TOKENS FOR UQL */ { <GT: "="">"> | <CRLF: ("\n")+=""> | <SYMBOLS: (["A"-"Z",="" "a"-"z"])+=""> | <OTHER: (~["A"-"Z",="" "a"-"z","\n","="">"]) > } void input():{} { (fasta() )+ {System.out.println(total/(double)sequences);} } void fasta():{} { header() (line())*} void header():{} { <GT> (<SYMBOLS>|<OTHER>|<GT>)* <CRLF> {sequences++;}} void line():{Token t;} { t=<SYMBOLS> <CRLF> { total+=t.image.length(); }}  Compile javacc Golf.javacc javac Golf.java  Test time java Golf < uniprot_sprot.fasta 352.6697028442464 real 0m11.014s user 0m11.529s sys 0m0.180s  ADD COMMENT 4 Entering edit mode 10.9 years ago Tom Walsh ▴ 550 Perl 6. Painfully slow, so I'm probably doing something dumb that isn't obvious to me. [?] ADD COMMENT 0 Entering edit mode one suggestion: in the case of multi-line FASTA files, the more common scenario is that any given line will NOT be a header line. thus, one speedup would be to test for "ne '>'" first in the while loop. not sure what speeds you're seeing but you could also try to "slurp" the file into memory first. ADD REPLY 0 Entering edit mode OK I'll give it a go. It's still much slower than Perl5 doing the same thing, i.e. reading line by line. ADD REPLY 0 Entering edit mode Maybe a grammar would be the best option? It is Perl 6... ADD REPLY 0 Entering edit mode I might give a grammar a go but I don't expect it'll help much. Even basic I/O - reading a file line by line = is still an order of magnitude slower than in Perl 5. ADD REPLY 3 Entering edit mode 10.9 years ago another [?]fastest[?] solution in 'C', with a lot of micro-optimization #include <stdio.h> #define BUFFER_SIZE 10000000 int main(int argc,char** argv) { size_t i; char buffer[BUFFER_SIZE]; long total=0L; long sequences=0L; size_t nRead=0; size_t inseq=1; while((nRead=fread(buffer,sizeof(char),BUFFER_SIZE,stdin))>0) { for(i=0;i<nRead;++i) { switch(buffer[i]) { case '>': if(inseq==1) ++sequences; inseq=0;break; case '\n': inseq=1;break; case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z': case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n': case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z': if(inseq==1) ++total;break; default: break; } } } printf("%f\n",(total/(1.0*sequences))); return 0; }  Compilation: gcc -O3 -Wall golf.c  Result: time ./a.out < uniprot_sprot.fasta 352.669703 real 0m0.952s user 0m0.808s sys 0m0.140s  ADD COMMENT 1 Entering edit mode Pierre, not to blow my own horn, but on my machine your C implementation is more than 2x slower than the one I posted ;) ADD REPLY 1 Entering edit mode but my solution is ANSI :-)) ADD REPLY 0 Entering edit mode oh nooooooo :-) ADD REPLY 0 Entering edit mode Touché! I guess the price of standard compliance is 2-3x loss in speed? Sounds about right to me ;-) ADD REPLY 0 Entering edit mode Touché! I guess the price to pay for standards compliance is 2-3x loss in execution speed? Sounds about right to me ;-) ADD REPLY 0 Entering edit mode ab-so-lu-te-ly ! :-) ADD REPLY 1 Entering edit mode 2.7 years ago this is a nim solution import strutils var line: string seqCnt = 0 charCnt = 0 proc main()= for line in stdin.lines: if line.startsWith('>'): inc seqCnt else: charCnt += len(line) echo (charCnt/seqCnt) main()  compilation nim compile golf.nim  result time ./golf < uniprot_sprot.fasta 358.9318342665173 real 0m2.293s user 0m0.892s sys 0m0.147s  (uniprot is a big larger now so the answer is slightly different) ADD COMMENT 1 Entering edit mode 9 weeks ago cicindel ▴ 70 Using the seqmagick utility (an imagemagick-inspired interface to biopython): seqmagick info *fasta  To get just the average lengths: seqmagick info *fasta | awk '{OFS="\t"; print$5, \$1}'