C/C++ Libraries For Bioinformatics?
14
35
Entering edit mode
12.1 years ago
User 59 13k

This question has come via a former colleague:

"I've done my work in bioinformatics in Python, but for data crunching it's really slow (one of the applications is running for almost a week), so I decided to switch to C. Does anybody know about something like Biopython in C (it seems frustrating to rewrite translation, transcription, database entries parsing from scratch)"

A cursory glance at Google shows there's a lot of unmaintained attempts to create a O|B|F style set of C libraries for computational biology. Rather than focusing on why Python isn't working for this particular case, does anyone know of actively maintained C/C++ libraries as described?

c c • 26k views
4
Entering edit mode

For eliminating bottlenecks you might want to look into scipy/weave, since it enables you to embed C/C++ code directly into your python scripts. See http://www.scipy.org/PerformancePython for details.

3
Entering edit mode

I'm assuming you know your stuff and have tried this, but whenever performance comes up, it's worth mentioning that profiling to identify bottlenecks can be very useful. If there is an obvious rate-limiting step, you may be able to extend python and just write a C function for that one small piece of the puzzle, rather than switching wholesale to C.

0
Entering edit mode

If you need to parse some structured data lex/yacc is your friend. for XML use libxml2, for ASN.1 use the NCBI asntool.

0
Entering edit mode

Thanks Chris, the originator of this question has been pointed to this thread - I'm sure they appreciate the comments about identifying bottlenecks as well.

29
Entering edit mode
12.1 years ago
Phis ★ 1.1k

The SeqAn C++ library is quite nice.

0
Entering edit mode

I think this might have been the one that was at the back of my mind - I had an inkling this was out there.

0
Entering edit mode

SeqAn is really good and complete. Bowtie and other tools use it.

15
Entering edit mode
12.1 years ago

The only stable library I know is the NCBI toolkit

4
Entering edit mode

But use the C++ version, as the C one is a PITA to compile and to use. Documentation is horrible (or at least was) for the C version.

0
Entering edit mode

14
Entering edit mode
12.1 years ago
brentp 24k

The genometools library is actively maintained, well-tested, has visualization tools, parsers, indexes, and a bunch of other tools.

It's also good to have a look at jim kent's stuff

9
Entering edit mode
12.1 years ago

I would like to mention Bio++, mostly rich for phylogenetic stuff (substitution models etc.).

Bio++ is a set of C++ libraries for Bioinformatics, including sequence analysis, phylogenetics, molecular evolution and population genetics.

then there is also bppSuite built on top of that.

6
Entering edit mode
12.1 years ago

I recommend easel (ythe library behind HMMER, you can get it with any source code release of HMMER3). I've read (though never used) large portions of its source code. The algorithms are documented, the code is lucid, and it is ansi C AFAIK, but without knowing what you need I cannot tell if it is in easel.

6
Entering edit mode
11.8 years ago
Rm 8.2k
4
Entering edit mode
12.1 years ago
Fiamh ▴ 220

Can I pitch Curtis Huttenhower's Sleipnir library? (Tutorial here)

-- Oliver

4
Entering edit mode
11.8 years ago
Ketil 4.1k

One of my "selling points" for using Haskell for bioinformatics is that it combines high level of abstraction with high performance. This makes it very quick to write applications, especially when chores like parsing file formats etc is already solved in a library, and yet the resulting applications compile to native code, and run at speeds comparable (typically within a factor of two) to C.

Perhaps Haskell isn't for everybody, but I think there is a clear need for a language that optimizes programmer time (which is, let's face it, a far more scarce resource than CPU time) without sacrificing too much CPU time.

0
Entering edit mode

Is there a Haskell bioinformatics library?

0
Entering edit mode
0
Entering edit mode

Yes there is that :-) There are actually several now, I'm working with the author of some of the others (dealing with RNA structure etc) to integrate them.

0
Entering edit mode

There are actually several now, and I'm working with the author of some of the other stuff (dealing with RNA structure) to integrate them. See the "bioinformatics" section on HackageDB

4
Entering edit mode
11.8 years ago

Kevin Thornton's libsequence is

a C++ library designed to aid writing applications for genomics and evolutionary genetics. A large amount of the library is dedicated to the analysis of "single nucleotide polymorphism", or SNP data. The library is intended to be viewed as a "BioC++" akin to the bioperl project, although ... libsequence tries not to re-invent the wheel. Rather, the focus is on biological computation, such as the analysis of SNP data and sequence divergence, and the analysis of data generated from coalescent simulation.

EDIT 16 July 2011:

GeCo++ (Genomic computation C++ library) is "a C++ class library to the purpose of making easier and faster the efficient implementation of algorithms for sequence analysis when functional annotations and genomic variations need to be considered."

4
Entering edit mode
11.8 years ago
lh3 33k

It really depends what you need, which you have not clarified. For writing algorithms, seqan is the best, but it is not for format parsing. For that purpose, no good C/C++ libraries. Java is usually better than scripts when speed is a concern.

EDIT: Maybe I can advertise my own works:

• kseq.h: The most efficient and versatile fasta/fastq parser.
• ksw: SSE2 Smith-Waterman. Probably only SWPS3 can achieve a similar speed.
• knhx: Light-weight New Hampshire parser (not thoroughly tested though)
• khmm: Basic HMM library.

These are all single-file or two-file libraries. If you need a component, copying one or two source files is enough. No worry about dependencies. These are also the most efficient among libraries having similar functionalities (e.g. kseq.h is 10X faster than the fastx_toolkit parser; ksw is 30X faster than a non-SSE2 implementation).

As to the speed of parsing: in another post, we have discussed that many bio* components are very inefficient for the sake of completeness. Writing your own can be by far faster. The key reason we do not write parsers in C is because for typical parsing, C is not much faster than a script. Maybe twice faster, but who cares?

0
Entering edit mode

Java is usually better when speed is a concern.

Do you find Java to be faster than C?

0
Entering edit mode

I actually mean Java is faster than scripting languages. Sorry for misleading.

0
Entering edit mode

In addition to SWPS3, there is also diagonalsw that achieve a similar speed. A drawback with SWPS3 is that its algorithm is buggy. It sometimes gives the wrong result. For details see here

0
Entering edit mode

Yes, SWPS3 is buggy, but I cannot recommend diagonalsw. I wasted half an hour and yet still could not compile it. Its dependency is really unnecessary.

1
Entering edit mode
11.1 years ago
Fede ▴ 10

GeCo++ , as suggested by Casey can do the job. And not just because I'm currently working on it (at the medea institute) ;)

0
Entering edit mode

1
Entering edit mode
10.5 years ago
Hamish ★ 3.2k

As well as being a very useful set of tools EMBOSS is also a C based framework for developing sequence analysis applications, see the EMBOSS Developers Guide.

Another option is BioLib, which aims to provide access to C/C++ libraries from BioPerl, BioRuby and BioPython. It may be quicker to use BioLib to improve the performance of parts of the Python code, instead of reimplementing in C/C++.

0
Entering edit mode
10.4 years ago

If you are working with XML data files that are defined by an XML schema, you could use either one of the open source software products

CodeSynthesis XSD/e

and

CodeSynthesis XSD

to generate parsing functions that give you a data object model to work with. You also get serialization functions for creating XML files. In addition to that, these two software products create binary formats that you could use if you need efficient storage of your XML files. Parsing and serialization of the binary format is also faster than the XML format.

The parsing and serialization of the XML files can be done in a streaming mode which is nice if the XML files are big.

One example of what you could do with these two software products is to read Uniprot XML files. I tried out CodeSynthesis XSD/e in streaming mode together with the Uniprot XML Schema.

It worked great and the parsing was really fast.

CodeSynthesis XSD is availabe in Ubuntu Linux. To install it run

root@ubuntu-linux:~# apt-get install xsdcxx
`
0
Entering edit mode
3.3 years ago

On CBioInfCpp.h as a C++ lib containing some functions for bioinformatics

Dear Sirs.

Though I am not a professional programmer, bionformatics is very interesting interdisciplinary field for me.

I see it, the Python is a "standart language" in this field.

But when I solved problems at rosalind info, I used C++. So as a result a "lib of some function" has been borned.

The lib contains 3 groups of functions. The first one - input-output ones (in order to read-write vectors, matrixes, graphs from-to a file via only one commsnd as it is in Python).

The second group is "Working with strings". Contains some functions from computing GC-content, Edit Distance etc to finding all mutated strings in a given one.

The third is "Working with graphs". A data structure "Adjacency vector" is suggested. By the way, in general case, vertices may have negative integers assigned and graphs may have multiple loops and edges. Some function such as Eulerian Cycle, Path finding, topological sorting etc are implemented.

May it be useful for some tasks?

I understand that this lib haven't a great majority of features. For example it is not able now to work with bioinformatic databases, but here I can not to implement it by myself only.

Free distributed source code and info is here: https://drive.google.com/open?id=1FQwsQm2kG_nTO45ab0yj52xtp6_B4IB2

My profile at Rosalind info http://rosalind.info/users/chernouhov/

Best regards, Chernouhov Sergey

23/06/2019 update:

• Group of function "FindIn" has been updated.
• Functions PairVectorCout, PairVectorFout has been updated.
• Group of function "GraphCout" and "GraphFout" has been added. So nowadays one may "cout/ fout" a graph that is set by Adjacency vector to screen/ to file line by line: one edge in one line.
• Function "StrToCircular" added for finding the circular string of minimal length of the given one.
• Group of function MaxFlowGraph" has been added to help find Maximal Flow, the paths of the maximal flow network and max-flow min-cut in a graph.
• A data structure "Adjacency map" (a modification of data structure for containing graphs "Adjacency vector") has been added. Adjacency map allows to have quicker access to edge’s weight, but it can’t work with multiple edges.
• Function TandemRepeatsFinding has been added. It is intended for finding tandem repeats in the given string that may be useful for solving problems related to Microsatellite Instability etc.

14.07.2019 update:

• Function CIGAR1 has been added.
• Group of function "GraphCout" and "GraphFout" has been updated (so nowadays one may "cout/ fout" a graph that is set by both Adjacency vector and Adjacency map to screen/ to file line by line: one edge in one line).
• Function EditDistA as an extended version of the function EditDist has been added (returns not only the value of Edit Distance between 2 strings but also one possible version of the alignment itself).

09.08.2019 update:

• Group of function "NBPaths" (for finding maximal branching paths in a graph, both weighted or no, direcyed or no) has been added.
• Functions ConsStringQ1 and ConsStringQ2 for building consensus string upon a given collection of strings according to their quality has been added. Note that due to little data for testing errors may be found here (please notify if you found any).

31.08.2019 update:

• Function GenRandomUWGraph that generates a random unweighted graph (as its "Adjacency vector") has been added.
• Group of function intended to find collection of vertices for each strongly connected component of directed graph and to find collection of vertices for each connected component of undirected graph has been added.
• Group of function for counting edges multiplicity of a graph that is set by Adjacency vector has been added.

19.10.2019:

• Updated Group of function GraphCout and GraphFout to deal with mega-maps.

03.11.2019

• Group of functions Num updated.
• Function ScoreStringMatrix that counts score (i.e. total number of mismatches) upon vector a of strings s added.
• Function GPPM that generates a position probability matrix (PPM) added. Note that pseudocounts may be used (the formula (Ns+z)/(N+2*z) is implemented).

26.11.2019

For further updates please see here: A: CBioInfCpp.h as a C++ lib containing some functions for bioinformatics

0
Entering edit mode

But: - GitHub is a new experience for me, so probably I DO some mistakes there. - Why only GitHub is the trusted place? We may be free but only there?

I Do declare that I DO NOT clearly understand all about GitHub so nowdays I use it only as a filehosting as it is so popular place.