This question has come via a former colleague:
"I've done my work in bioinformatics in Python, but for data crunching it's really slow (one of the applications is running for almost a week), so I decided to switch to C. Does anybody know about something like Biopython in C (it seems frustrating to rewrite translation, transcription, database entries parsing from scratch)"
A cursory glance at Google shows there's a lot of unmaintained attempts to create a O|B|F style set of C libraries for computational biology. Rather than focusing on why Python isn't working for this particular case, does anyone know of actively maintained C/C++ libraries as described?
I would like to mention Bio++, mostly rich for phylogenetic stuff (substitution models etc.).
Bio++is a set of C++ libraries for Bioinformatics, including sequence analysis, phylogenetics, molecular evolution and population genetics.
then there is also bppSuite built on top of that.
I recommend easel (ythe library behind HMMER, you can get it with any source code release of HMMER3). I've read (though never used) large portions of its source code. The algorithms are documented, the code is lucid, and it is ansi C AFAIK, but without knowing what you need I cannot tell if it is in easel.
One of my "selling points" for using Haskell for bioinformatics is that it combines high level of abstraction with high performance. This makes it very quick to write applications, especially when chores like parsing file formats etc is already solved in a library, and yet the resulting applications compile to native code, and run at speeds comparable (typically within a factor of two) to C.
Perhaps Haskell isn't for everybody, but I think there is a clear need for a language that optimizes programmer time (which is, let's face it, a far more scarce resource than CPU time) without sacrificing too much CPU time.
Kevin Thornton's libsequence "is a C++ library designed to aid writing applications for genomics and evolutionary genetics. A large amount of the library is dedicated to the analysis of "single nucleotide polymorphism", or SNP data. The library is intended to be viewed as a "BioC++" akin to the bioperl project, although ... libsequence tries not to re-invent the wheel. Rather, the focus is on biological computation, such as the analysis of SNP data and sequence divergence, and the analysis of data generated from coalescent simulation."
EDIT 16 July 2011:
GeCo++ (Genomic computation C++ library) is "a C++ class library to the purpose of making easier and faster the efficient implementation of algorithms for sequence analysis when functional annotations and genomic variations need to be considered."
It really depends what you need, which you have not clarified. For writing algorithms, seqan is the best, but it is not for format parsing. For that purpose, no good C/C++ libraries. Java is usually better than scripts when speed is a concern.
EDIT: Maybe I can advertise my own works:
kseq.h: The most efficient and versatile fasta/fastq parser.
ksw: SSE2 Smith-Waterman. Probably only SWPS3 can achieve a similar speed.
knhx: Light-weight New Hampshire parser (not thoroughly tested though)
khmm: Basic HMM library.
These are all single-file or two-file libraries. If you need a component, copying one or two source files is enough. No worry about dependencies. These are also the most efficient among libraries having similar functionalities (e.g. kseq.h is 10X faster than the fastx_toolkit parser; ksw is 30X faster than a non-SSE2 implementation).
As to the speed of parsing: in another post, we have discussed that many bio* components are very inefficient for the sake of completeness. Writing your own can be by far faster. The key reason we do not write parsers in C is because for typical parsing, C is not much faster than a script. Maybe twice faster, but who cares?
Another option is BioLib, which aims to provide access to C/C++ libraries from BioPerl, BioRuby and BioPython. It may be quicker to use BioLib to improve the performance of parts of the Python code, instead of reimplementing in C/C++.
If you are working with XML data files that are defined by an XML schema, you could use either one of the open source software products
to generate parsing functions that give you a data object model to work with. You also get serialization functions for creating XML files. In addition to that, these two software products create binary formats that you could use if you need efficient storage of your XML files. Parsing and serialization of the binary format is also faster than the XML format.
The parsing and serialization of the XML files can be done in a streaming mode which is nice if the XML files are big.
One example of what you could do with these two software products is to read Uniprot XML files. I tried out CodeSynthesis XSD/e in streaming mode together with the Uniprot XML Schema:
It worked great and the parsing was really fast.
CodeSynthesis XSD is availabe in Ubuntu Linux. To install it run
root@ubuntu-linux:~# apt-get install xsdcxx