Question

Largest Bioinformatics Software Project?

3

Entering edit mode

14.3 years ago

Tomer Altman ▴ 40

I'm trying to figure out which bioinformatics software project is the largest, and how much effort goes into its development. It would be great if people could post what they know about software projects that they work on or use. The three main attributes that I am trying to figure out for each project are:

How many lines-of-code does the software consist of?
What are the primary languages that the software is written in?
How many full-time employees (FTEs) are devoted to the development of the software?

Usage statistics for the project (i.e., size of the user community) for the software would also be great.

I'm interested in both open-source and proprietary software.

Thanks!

software comparison • 8.6k views

ADD COMMENT • link updated 14.3 years ago by lh3 33k • written 14.3 years ago by Tomer Altman ▴ 40

4

Entering edit mode

Comparing the number of lines in projects is no indication of the projects size. For example, the number of lines in languages that require braces will inflate the line count significantly. Furthermore, whitespace and comments can also inflate the count.

ADD REPLY • link 14.3 years ago by Gww ★ 2.7k

2

Entering edit mode

An interesting though controversial question. For people who want to post answers, I would recommend to use the same program to count lines-of-code. `wc -l' seems too primitive. I would recommend cloc.pl (a single perl script): http://sourceforge.net/projects/cloc/files/cloc/v1.53/

ADD REPLY • link 14.3 years ago by lh3 33k

2

Entering edit mode

I don't think that the number of lines of code is a good measure of the amount of effort went into a project. Writing compact, efficient, reusable code takes orders of magnitude more time that writing bloated, inefficient code with lots of code duplication due to bad design.

ADD REPLY • link 14.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

I disagree. Whilst both of you are right that loc are an ambiguous metric, it is still a fantastic estimate on a project's size when looking at the order-of-magnitute. Obviously, a project with 1k loc is significantly smaller than a 1m loc project -- no matter how you factor in braces, comments, and generous use of blank lines.

ADD REPLY • link 14.3 years ago by Joachim ★ 2.9k

0

Entering edit mode

I entirely agree LOC is not a perfect measurement (actually all my projects will be underestimated by LOC), but it is at least a measurement and frequently not so misleading. How can we prove a LIMS is the largest project without measuring it?

ADD REPLY • link 14.3 years ago by lh3 33k

0

Entering edit mode

Using LOC is just fine for T-Shirt sizing (S, M, L) software projects. But implies that you either have access to a published LOC stat or the actual source code. Could you infer the size of projects based on published results or some other published metric?

What if you did this: 1. Do a Google search to get a list of bioinformatics software. 2. Create a Google Mashup to auto search each of the titles, record the hit count. 3. Use the number of Google hit metric to at least infer the popularity of the software.

ADD REPLY • link 14.3 years ago by Ben Lange ▴ 210

0

Entering edit mode

Can some admin close & purge this question + answers? Apparently it has become a discussion about locs and no one really addresses the Tomer's points about language usage and full-time employees.

ADD REPLY • link 14.3 years ago by Joachim ★ 2.9k

0

Entering edit mode

Please do not close this question. The comments here are all about LOC, but the answers not.

ADD REPLY • link 14.3 years ago by lh3 33k

0

Entering edit mode

Thanks to everyone for your energetic replies. I think you have all won me over to biostar!

My coworker recommended the following LoC tool: http://www.dwheeler.com/sloccount/

This is the one of Linux kernel fame. I'm curious as to how it fares against cloc.pl & other tools.

I'm with everyone regarding LoC not being the end-all of software complexity/size/feature metrics, but it's a useful if imperfect one. @Ben, thanks for the recommended approach. While there might be some noise in that approach, I'll definitely add a column for measures of the user community.

ADD REPLY • link 14.3 years ago by Tomer Altman ▴ 40

0

Entering edit mode

cloc.pl uses source code from SLOCCount. I believe cloc.pl learns from SLOCCount.

ADD REPLY • link 14.3 years ago by lh3 33k

Ram · Answer 1 · 2011-03-30

11

Entering edit mode

14.3 years ago

lh3 33k

The following statistics come from ohloh or from cloc.pl count. [?]

#Project      Language      Code   Comment     Blank     Date/Ver   Source     FTEs
Bioclipse    Java       578,095   349,515   154,338   04/02/2011    Ohloh     ?
Bioconductor R/C/C++  1,248,634   276,358   218,222   03/30/2011    cloc+awk  ?
BioJava      Java       272,864   129,237    59,074   03/30/2011    Ohloh     ?
BioMart      Java/Perl   98,637    43,231    24,346   03/30/2011    Ohloh     ?
BioPerl      Perl       323,007   258,987   167,907   03/30/2011    Ohloh     ?
BioPython    Python     120,824    39,085    22,183   03/30/2011    Ohloh     ?
BioRuby      Ruby        68,390    27,032    15,636   03/30/2011    Ohloh     ?
EMBOSS       C          633,014   258,265   215,110   04/02/2011    Ohloh     ?
flystockdb   JS/Ruby      7,845         ?         ?   ?             ?         1
JKsrc        C          827,908   111,490   105,524   03/31/2011    Ohloh     ?
Jmol         Java       213,645    58,930    28,784   03/30/2011    Ohloh     ?
ncbi_cxx     C++/C    1,112,817   318,441   250,134   Jun_15_2010   cloc.pl   ?
OpenMS       C++        219,835    77,201    51,512   04/02/2011    Ohloh     ?
SeqAn        C++/C      250,390    89,885    55,212   03/30/2011    Ohloh     ?
SHOGUN       C++/C      128,232    53,367    33,488   04/02/2011    Ohloh     ?

[?]

There are a few caveats to get the table. As the others have argued, these numbers are not a good indication of how large the project is. Just give you a very rough idea.

EDIT 03/31/2011: JKsrc from ohloh, LOCs very similar to cloc.pl results.

EDIT 04/02/2011: Updated EMBOSS with LOCs from ohloh (I modified its Enlistment list because the old one points its documentation only); added OpenMS (I modified its Enlistment list because the old one includes SVN tags and branches but we should count trunk only); added SHOGUN; added Ensembl to Ohloh, but Ohloh has problems with analyzing its repository; updated Bioclipse as Egon has updated its enlistment. Sorry to push this answer up. I just want to keep it updated. [?]

Further to demonstrate cloc.pl. I downloaded Jim Kent's source codes jksrc.zip, unzipped it and counted lines of codes with the following command line:

find -type f|egrep "\.(c|h|cpp|cc|hpp|hh|java|py|pl|pm|rb|lua|html|htm|js|php|sql)$" > file.list; cloc.pl --list-file=file.list

The output is:

[?]

This jksrc.zip is one of the largest collections of C source codes (if not the largest). It is the base of the UCSC genome browser and a lot of other utilities such as the famous BLAT.

Please include FTE estimates when available.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 14.3 years ago by lh3 33k

4

Entering edit mode

Well, I "use" UCSC genome browser and I have answered the first two points.

ADD REPLY • link 14.3 years ago by lh3 33k

1

Entering edit mode

cloc.pl seems to be a very nice tool for analyzing projects.

ADD REPLY • link 14.3 years ago by Farhat ★ 2.9k

0

Entering edit mode

Sorry for the downvote, but the question clearly said "It would be great if people could post what they know about software projects that they work on or use." and then gave three points of particular interest. Only one point is answered here. It is great that cloc.pl counts the loc so accurately though.

ADD REPLY • link 14.3 years ago by Joachim ★ 2.9k

0

Entering edit mode

I'm updating Ohloh's 'enlistments' for Bioclipse...

ADD REPLY • link 14.3 years ago by Egon Willighagen 5.4k

0

Entering edit mode

OK, not all repositories added yet, but the current count on Ohloh is now: 1,192,189

ADD REPLY • link 14.3 years ago by Egon Willighagen 5.4k

0

Entering edit mode

I've added a column for FTE estimates. I'm not sure if there's a good way to gauge that for open source projects, since there's a distribution of participant commitment levels. Suggestions welcome.

ADD REPLY • link 14.3 years ago by Tomer Altman ▴ 40

0

Entering edit mode

FTE is very difficult to measure. Most of the large projects have contributors everywhere in the world. They are frequently not committed to the project in full time.

ADD REPLY • link 14.3 years ago by lh3 33k

score 3 · Answer 2 · 2011-03-30

I can make two contributions here:

Biomart, http://www.biomart.org
103106 loc Java (for i in ```find . -name *.java``` ; do cat $i ; done | wc -l)
27369 loc JavaScript (for i in ```find . -name *.js``` ; do cat $i ; done | wc -l)
both loc determined for BioMart rc5, http://www.biomart.org/rc5_documentation.pdf
9 team members, full-time, http://www.biomart.org/credits.html
open-source project, but property of OICR, http://www.oicr.on.ca
used by a lot of researchers around the globe
flystockdb, https://www.flystockdb.org and http://joachimbaran.wordpress.com/tag/flystockdb
sorry, but I have not bought a certificate for the https-domain yet
[?][?]2887 loc JavaScript pure flystockdb
+3859 loc JavaScript Gazebo, which is a framework I develop and use for flystockdb
+1099 loc Ruby Gazebo, ditto
1 lonely "team" member, spare-time
open-source project (Simplified BSD-License), repo at: https://github.com/joejimbo/flystockdb
not officially released yet, screencast demo: http://bergmanlab.smith.man.ac.uk/?p=704

score 3 · Answer 3 · 2011-03-31

3

Entering edit mode

14.3 years ago

Laura ★ 1.8k

Both Major Genome Browsers Ensembl and UCSC are very long standing and large bioinformatics projects, Ensembl had its 10 birthday last year

ADD COMMENT • link 14.3 years ago by Laura ★ 1.8k

0

Entering edit mode

Good point! And in that vein - think of the management of biological data that goes on at NCBI. NCBI is 22 years old; GenBank is 28 years old.

ADD REPLY • link 14.3 years ago by Larry_Parnell 16k

score 2 · Answer 4 · 2011-03-30

2

Entering edit mode

14.3 years ago

Chris Evelo 10k

If it is about open source and online available projects you could use [?]Ohloh[?] which is described [?]here[?]. We found it very useful to estimate the value of existing projects (which comes in handy in grant applications to extend them) and also to find license conflicts between the main project and libraries used. (Although to be honest we have not found a good way to deal with that when they are found).

ADD COMMENT • link 14.3 years ago by Chris Evelo 10k

1

Entering edit mode

Interesting web-site. Even though I find it suspect that it contains very outdated information about BioMart (it is OICR+EBI for years now, not EBI+CSHL) and its claim that BioMart has "2" users is an understatement.

ADD REPLY • link 14.3 years ago by Joachim ★ 2.9k

1

Entering edit mode

The license conflicts are nasty. Larger projects have their own licenses and are build on software libraries that also have their own. Some of those underlying licenses are viral. Meaning if you use the code the license propagates automatically. At that stage you have two licenses for the same code, and they may conflict. Ohloh finds these conflicts in your code base. But of course leaves it up to you how to deal with it.

ADD REPLY • link 14.3 years ago by Chris Evelo 10k

0

Entering edit mode

Thanks for your comment. I think you meant to write Ohloh, which I am familiar with and impressed with. Could you describe more by what you mean by "comes in handly" for grant applications, and "find license conflicts"?

ADD REPLY • link 14.3 years ago by Tomer Altman ▴ 40

0

Entering edit mode

Yes, you are right. Sorry for the typo. The links were correct though. In grant applications you want to do something new. For us that could be something like the integration of miRNA's in pathways like Larry asked about recently. Now it helps to show that you will incorporate something new in something that already exists and already had 20 man years of work in it. So the funding agency essentially gets 20 years of development for free.

ADD REPLY • link 14.3 years ago by Chris Evelo 10k

0

Entering edit mode

I really hate GPL for this reason. BTW, when Debian decides to include a project, it will check the license very carefully.

ADD REPLY • link 14.3 years ago by lh3 33k

score 1 · Answer 5 · 2011-03-30

1

Entering edit mode

14.3 years ago

Larry_Parnell 16k

My experience indicates that LIMS (Laboratory Information Management Systems) are quite large - in terms of hours spent discussing features and writing code and money spent. Well, time = money, right? In a company where I once worked, we were a bioinformatics team of about 14 on average. I recall that there were 6 to 8 people working on LIMS. That was for a genomics-oriented biotech company. If you carry such efforts to management of patient data in a hospital setting, I would expect the situation to be even grander in scope and cost.

ADD COMMENT • link 14.3 years ago by Larry_Parnell 16k

0

Entering edit mode

I think you are quite likely right that the biggest software projects will be data management systems of various sorts. The answer to the original question thus boils down to "that depends on your definition of bioinformatics.

ADD REPLY • link 14.3 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Do you have a very rough idea how many lines of codes? 1 million or 10 million?

ADD REPLY • link 14.3 years ago by lh3 33k

0

Entering edit mode

No idea (lines of code) - as I was in the analysis group and not in LIMS. If the LIMS is substantially or partly engineered to use or process biological data, then it certainly can sit under a bioinformatics umbrella. Indeed, there was some distinction with our computational group between developers (mostly LIMS) and the analysis group (also wrote code, but also interpreted data).

ADD REPLY • link 14.3 years ago by Larry_Parnell 16k

0

Entering edit mode

Does anybody know of particular LIMS platforms that are distributed, whether open source or commercial?

ADD REPLY • link 14.3 years ago by Tomer Altman ▴ 40

score 0 · Answer 6 · 2011-03-30

0

Entering edit mode

14.3 years ago

Pierre Lindenbaum 166k

my 2 cents for the NCI C toolkit:

find ./ -type f| egrep  '\.(c|cpp|h|hh|1|xml|java)$'  | grep -v '/doc/' | xargs cat | wc
1824761 5916928 54595812

ADD COMMENT • link 14.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Is this distinct from the NCBI C++ toolkit reported above? Do you know if the NCBI C toolkit & NCBI C++ toolkits overlap one another?

ADD REPLY • link 14.3 years ago by Tomer Altman ▴ 40