Question: C/C++ Libraries For Bioinformatics?
30
gravatar for Daniel Swan
3.8 years ago by
Daniel Swan10k
Oxford, UK
Daniel Swan10k wrote:

This question has come via a former colleague:

"I've done my work in bioinformatics in Python, but for data crunching it's really slow (one of the applications is running for almost a week), so I decided to switch to C. Does anybody know about something like Biopython in C (it seems frustrating to rewrite translation, transcription, database entries parsing from scratch)"

A cursory glance at Google shows there's a lot of unmaintained attempts to create a O|B|F style set of C libraries for computational biology. Rather than focusing on why Python isn't working for this particular case, does anyone know of actively maintained C/C++ libraries as described?

C
ADD COMMENTlink written 3.8 years ago by Daniel Swan10k
3

I'm assuming you know your stuff and have tried this, but whenever performance comes up, it's worth mentioning that profiling to identify bottlenecks can be very useful. If there is an obvious rate-limiting step, you may be able to extend python and just write a C function for that one small piece of the puzzle, rather than switching wholesale to C.

ADD REPLYlink written 3.8 years ago by Chris Miller12k
3

For eliminating bottlenecks you might want to look into scipy/weave, since it enables you to embed C/C++ code directly into your python scripts. See http://www.scipy.org/PerformancePython for details.

ADD REPLYlink written 3.8 years ago by Michael Schubert5.9k

If you need to parse some structured data lex/yacc is your friend. for XML use libxml2, for ASN.1 use the NCBI asntool.

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

Thanks Chris, the originator of this question has been pointed to this thread - I'm sure they appreciate the comments about identifying bottlenecks as well.

ADD REPLYlink written 3.8 years ago by Daniel Swan10k
26
gravatar for Phis
3.8 years ago by
Phis1000
CH
Phis1000 wrote:

The SeqAn C++ library is quite nice.

ADD COMMENTlink written 3.8 years ago by Phis1000

I think this might have been the one that was at the back of my mind - I had an inkling this was out there.

ADD REPLYlink written 3.8 years ago by Daniel Swan10k

SeqAn is really good and complete. Bowtie and other tools use it.

ADD REPLYlink written 2.8 years ago by Bioquant150
15
gravatar for Pierre Lindenbaum
3.8 years ago by
France
Pierre Lindenbaum58k wrote:

The only stable library I know is the NCBI toolkit http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/

ADD COMMENTlink written 3.8 years ago by Pierre Lindenbaum58k
2

But use the C++ version, as the C one is a PITA to compile and to use. Documentation is horrible (or at least was) for the C version.

ADD REPLYlink written 3.8 years ago by Paulo Nuin3.5k

Thanks Pierre, I hadn't thought about the NCBI toolkit

ADD REPLYlink written 3.8 years ago by Daniel Swan10k
13
gravatar for brentp
3.8 years ago by
brentp17k
Denver, Colorado
brentp17k wrote:

the genometools library is actively maintained, well-tested, has visualization tools, parsers, indexes, and a bunch of other tools.

It's also good to have a look at jim kent's stuff

ADD COMMENTlink written 3.8 years ago by brentp17k
8
gravatar for Haibao Tang
3.8 years ago by
Haibao Tang2.7k
Rockville, MD
Haibao Tang2.7k wrote:

I would like to mention Bio++, mostly rich for phylogenetic stuff (substitution models etc.).

Bio++ is a set of C++ libraries for Bioinformatics, including sequence analysis, phylogenetics, molecular evolution and population genetics.

then there is also bppSuite built on top of that.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Haibao Tang2.7k
6
gravatar for Marcin Cieslik
3.8 years ago by
Marcin Cieslik490 wrote:

I recommend easel (ythe library behind HMMER, you can get it with any source code release of HMMER3). I've read (though never used) large portions of its source code. The algorithms are documented, the code is lucid, and it is ansi C AFAIK, but without knowing what you need I cannot tell if it is in easel.

ADD COMMENTlink written 3.8 years ago by Marcin Cieslik490
6
4
gravatar for Fiamh
3.8 years ago by
Fiamh190
Boston, MA
Fiamh190 wrote:

Can I pitch Curtis Huttenhower's Sleipnir library?

http://huttenhower.sph.harvard.edu/sleipnir/

(Tutorial at http://www.huttenhower.org/content/getting-started-sleipnir)

-- Oliver

ADD COMMENTlink written 3.8 years ago by Fiamh190
4
gravatar for Ketil
3.5 years ago by
Ketil3.3k
Ketil3.3k wrote:

One of my "selling points" for using Haskell for bioinformatics is that it combines high level of abstraction with high performance. This makes it very quick to write applications, especially when chores like parsing file formats etc is already solved in a library, and yet the resulting applications compile to native code, and run at speeds comparable (typically within a factor of two) to C.

Perhaps Haskell isn't for everybody, but I think there is a clear need for a language that optimizes programmer time (which is, let's face it, a far more scarce resource than CPU time) without sacrificing too much CPU time.

ADD COMMENTlink written 3.5 years ago by Ketil3.3k

Is there a Haskell bioinformatics library?

ADD REPLYlink written 3.5 years ago by Erik Garrison1.1k

Ah, there is: http://blog.malde.org/index.php/the-haskell-bioinformatics-library/ :)

ADD REPLYlink written 3.5 years ago by Erik Garrison1.1k

Yes there is that :-) There are actually several now, I'm working with the author of some of the others (dealing with RNA structure etc) to integrate them.

ADD REPLYlink written 3.5 years ago by Ketil3.3k

There are actually several now, and I'm working with the author of some of the other stuff (dealing with RNA structure) to integrate them. See the "bioninformatics" section on HackageDB: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:bioinformatics

ADD REPLYlink written 3.5 years ago by Ketil3.3k
4
gravatar for Casey Bergman
3.5 years ago by
Casey Bergman15k
Manchester, UK
Casey Bergman15k wrote:

Kevin Thornton's libsequence "is a C++ library designed to aid writing applications for genomics and evolutionary genetics. A large amount of the library is dedicated to the analysis of "single nucleotide polymorphism", or SNP data. The library is intended to be viewed as a "BioC++" akin to the bioperl project, although ... libsequence tries not to re-invent the wheel. Rather, the focus is on biological computation, such as the analysis of SNP data and sequence divergence, and the analysis of data generated from coalescent simulation."

EDIT 16 July 2011:

GeCo++ (Genomic computation C++ library) is "a C++ class library to the purpose of making easier and faster the efficient implementation of algorithms for sequence analysis when functional annotations and genomic variations need to be considered."

ADD COMMENTlink modified 2.8 years ago • written 3.5 years ago by Casey Bergman15k
4
gravatar for lh3
3.5 years ago by
lh320k
lh320k wrote:

It really depends what you need, which you have not clarified. For writing algorithms, seqan is the best, but it is not for format parsing. For that purpose, no good C/C++ libraries. Java is usually better than scripts when speed is a concern.

EDIT: Maybe I can advertise my own works:

kseq.h: The most efficient and versatile fasta/fastq parser.

ksw: SSE2 Smith-Waterman. Probably only SWPS3 can achieve a similar speed.

knhx: Light-weight New Hampshire parser (not thoroughly tested though)

khmm: Basic HMM library.

These are all single-file or two-file libraries. If you need a component, copying one or two source files is enough. No worry about dependencies. These are also the most efficient among libraries having similar functionalities (e.g. kseq.h is 10X faster than the fastx_toolkit parser; ksw is 30X faster than a non-SSE2 implementation).

As to the speed of parsing: in another post, we have discussed that many bio* components are very inefficient for the sake of completeness. Writing your own can be by far faster. The key reason we do not write parsers in C is because for typical parsing, C is not much faster than a script. Maybe twice faster, but who cares?

ADD COMMENTlink modified 2.2 years ago • written 3.5 years ago by lh320k

"Java is usually better when speed is a concern." Do you find Java to be faster than C?

ADD REPLYlink written 3.1 years ago by Aaronquinlan7.4k

I actually mean Java is faster than scripting languages. Sorry for misleading.

ADD REPLYlink written 3.1 years ago by lh320k

In addition to SWPS3, there is also diagonalsw that achieve a similar speed. A drawback with SWPS3 is that its algorithm is buggy. It sometimes gives the wrong result. For details see diagonalsw.sourceforge.net/#swps3

ADD REPLYlink written 2.1 years ago by Erik Sjölund0

Yes, SWPS3 is buggy, but I cannot recommend diagonalsw. I wasted half an hour and yet still could not compile it. Its dependency is really unnecessary.

ADD REPLYlink written 2.1 years ago by lh320k
1
gravatar for Fede
2.8 years ago by
Fede10
Fede10 wrote:

GeCo++ , as suggested by Casey can do the job. And not just because I'm currently working on it (at the medea institute) ;)

ADD COMMENTlink written 2.8 years ago by Fede10

Hi fede, wellcome to BioStar. Comments like this should be included using the "add comment" function underneath and answer/question. If you can please make this edit, then I can delete this "answer"

ADD REPLYlink written 2.8 years ago by Casey Bergman15k
1
gravatar for Hamish
2.2 years ago by
Hamish2.4k
UK
Hamish2.4k wrote:

As well as being a very useful set of tools EMBOSS is also a C based framework for developing sequence analysis applications, see the EMBOSS Developers Guide.

Another option is BioLib, which aims to provide access to C/C++ libraries from BioPerl, BioRuby and BioPython. It may be quicker to use BioLib to improve the performance of parts of the Python code, instead of reimplementing in C/C++.

ADD COMMENTlink written 2.2 years ago by Hamish2.4k
0
gravatar for Erik Sjölund
2.1 years ago by
Erik Sjölund0 wrote:

If you are working with XML data files that are defined by an XML schema, you could use either one of the open source software products

CodeSynthesis XSD/e

and

CodeSynthesis XSD

to generate parsing functions that give you a data object model to work with. You also get serialization functions for creating XML files. In addition to that, these two software products create binary formats that you could use if you need efficient storage of your XML files. Parsing and serialization of the binary format is also faster than the XML format.

The parsing and serialization of the XML files can be done in a streaming mode which is nice if the XML files are big.

One example of what you could do with these two software products is to read Uniprot XML files. I tried out CodeSynthesis XSD/e in streaming mode together with the Uniprot XML Schema:

http://www.uniprot.org/docs/uniprot.xsd

It worked great and the parsing was really fast.

CodeSynthesis XSD is availabe in Ubuntu Linux. To install it run

root@ubuntu-linux:~# apt-get install xsdcxx
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Erik Sjölund0
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 533 users visited in the last hour