Question: Largest Bioinformatics Software Project?
3
gravatar for Tomer Altman
9.4 years ago by
Tomer Altman40
Tomer Altman40 wrote:

I'm trying to figure out which bioinformatics software project is the largest, and how much effort goes into its development. It would be great if people could post what they know about software projects that they work on or use. The three main attributes that I am trying to figure out for each project are:

  • How many lines-of-code does the software consist of?
  • What are the primary languages that the software is written in?
  • How many full-time employees (FTEs) are devoted to the development of the software?

Usage statistics for the project (i.e., size of the user community) for the software would also be great.

I'm interested in both open-source and proprietary software.

Thanks!

software comparison • 3.6k views
ADD COMMENTlink modified 9.4 years ago by lh332k • written 9.4 years ago by Tomer Altman40
4

Comparing the number of lines in projects is no indication of the projects size. For example, the number of lines in languages that require braces will inflate the line count significantly. Furthermore, whitespace and comments can also inflate the count.

ADD REPLYlink written 9.4 years ago by Gww2.7k
2

An interesting though controversial question. For people who want to post answers, I would recommend to use the same program to count lines-of-code. `wc -l' seems too primitive. I would recommend cloc.pl (a single perl script): http://sourceforge.net/projects/cloc/files/cloc/v1.53/

ADD REPLYlink written 9.4 years ago by lh332k
2

I don't think that the number of lines of code is a good measure of the amount of effort went into a project. Writing compact, efficient, reusable code takes orders of magnitude more time that writing bloated, inefficient code with lots of code duplication due to bad design.

ADD REPLYlink written 9.4 years ago by Lars Juhl Jensen11k

I disagree. Whilst both of you are right that loc are an ambiguous metric, it is still a fantastic estimate on a project's size when looking at the order-of-magnitute. Obviously, a project with 1k loc is significantly smaller than a 1m loc project -- no matter how you factor in braces, comments, and generous use of blank lines.

ADD REPLYlink written 9.4 years ago by Joachim2.9k

I entirely agree LOC is not a perfect measurement (actually all my projects will be underestimated by LOC), but it is at least a measurement and frequently not so misleading. How can we prove a LIMS is the largest project without measuring it?

ADD REPLYlink written 9.4 years ago by lh332k

Using LOC is just fine for T-Shirt sizing (S, M, L) software projects. But implies that you either have access to a published LOC stat or the actual source code. Could you infer the size of projects based on published results or some other published metric?

What if you did this: 1. Do a Google search to get a list of bioinformatics software. 2. Create a Google Mashup to auto search each of the titles, record the hit count. 3. Use the number of Google hit metric to at least infer the popularity of the software.

ADD REPLYlink written 9.4 years ago by Ben Lange190

Can some admin close & purge this question + answers? Apparently it has become a discussion about locs and no one really addresses the Tomer's points about language usage and full-time employees.

ADD REPLYlink written 9.4 years ago by Joachim2.9k

Please do not close this question. The comments here are all about LOC, but the answers not.

ADD REPLYlink written 9.4 years ago by lh332k

Thanks to everyone for your energetic replies. I think you have all won me over to biostar!

My coworker recommended the following LoC tool: http://www.dwheeler.com/sloccount/

This is the one of Linux kernel fame. I'm curious as to how it fares against cloc.pl & other tools.

I'm with everyone regarding LoC not being the end-all of software complexity/size/feature metrics, but it's a useful if imperfect one. @Ben, thanks for the recommended approach. While there might be some noise in that approach, I'll definitely add a column for measures of the user community.

ADD REPLYlink written 9.4 years ago by Tomer Altman40

cloc.pl uses source code from SLOCCount. I believe cloc.pl learns from SLOCCount.

ADD REPLYlink written 9.4 years ago by lh332k
11
gravatar for lh3
9.4 years ago by
lh332k
United States
lh332k wrote:

The following statistics come from ohloh or from cloc.pl count. [?]


#Project      Language      Code   Comment     Blank     Date/Ver   Source     FTEs
Bioclipse    Java       578,095   349,515   154,338   04/02/2011    Ohloh     ?
Bioconductor R/C/C++  1,248,634   276,358   218,222   03/30/2011    cloc+awk  ?
BioJava      Java       272,864   129,237    59,074   03/30/2011    Ohloh     ?
BioMart      Java/Perl   98,637    43,231    24,346   03/30/2011    Ohloh     ?
BioPerl      Perl       323,007   258,987   167,907   03/30/2011    Ohloh     ?
BioPython    Python     120,824    39,085    22,183   03/30/2011    Ohloh     ?
BioRuby      Ruby        68,390    27,032    15,636   03/30/2011    Ohloh     ?
EMBOSS       C          633,014   258,265   215,110   04/02/2011    Ohloh     ?
flystockdb   JS/Ruby      7,845         ?         ?   ?             ?         1
JKsrc        C          827,908   111,490   105,524   03/31/2011    Ohloh     ?
Jmol         Java       213,645    58,930    28,784   03/30/2011    Ohloh     ?
ncbi_cxx     C++/C    1,112,817   318,441   250,134   Jun_15_2010   cloc.pl   ?
OpenMS       C++        219,835    77,201    51,512   04/02/2011    Ohloh     ?
SeqAn        C++/C      250,390    89,885    55,212   03/30/2011    Ohloh     ?
SHOGUN       C++/C      128,232    53,367    33,488   04/02/2011    Ohloh     ?

[?]

There are a few caveats to get the table. As the others have argued, these numbers are not a good indication of how large the project is. Just give you a very rough idea.

EDIT 03/31/2011: JKsrc from ohloh, LOCs very similar to cloc.pl results.

EDIT 04/02/2011: Updated EMBOSS with LOCs from ohloh (I modified its Enlistment list because the old one points its documentation only); added OpenMS (I modified its Enlistment list because the old one includes SVN tags and branches but we should count trunk only); added SHOGUN; added Ensembl to Ohloh, but Ohloh has problems with analyzing its repository; updated Bioclipse as Egon has updated its enlistment. Sorry to push this answer up. I just want to keep it updated. [?]

Further to demonstrate cloc.pl. I downloaded Jim Kent's source codes jksrc.zip, unzipped it and counted lines of codes with the following command line:

find -type f|egrep "\.(c|h|cpp|cc|hpp|hh|java|py|pl|pm|rb|lua|html|htm|js|php|sql)$" > file.list; cloc.pl --list-file=file.list

The output is:

[?]

This jksrc.zip is one of the largest collections of C source codes (if not the largest). It is the base of the UCSC genome browser and a lot of other utilities such as the famous BLAT.

Please include FTE estimates when available.

ADD COMMENTlink modified 10 months ago by RamRS28k • written 9.4 years ago by lh332k
4

Well, I "use" UCSC genome browser and I have answered the first two points.

ADD REPLYlink written 9.4 years ago by lh332k
1

cloc.pl seems to be a very nice tool for analyzing projects.

ADD REPLYlink written 9.4 years ago by Farhat2.9k

Sorry for the downvote, but the question clearly said "It would be great if people could post what they know about software projects that they work on or use." and then gave three points of particular interest. Only one point is answered here. It is great that cloc.pl counts the loc so accurately though.

ADD REPLYlink written 9.4 years ago by Joachim2.9k

I'm updating Ohloh's 'enlistments' for Bioclipse...

ADD REPLYlink written 9.4 years ago by Egon Willighagen5.2k

OK, not all repositories added yet, but the current count on Ohloh is now: 1,192,189

ADD REPLYlink written 9.4 years ago by Egon Willighagen5.2k

I've added a column for FTE estimates. I'm not sure if there's a good way to gauge that for open source projects, since there's a distribution of participant commitment levels. Suggestions welcome.

ADD REPLYlink written 9.4 years ago by Tomer Altman40

FTE is very difficult to measure. Most of the large projects have contributors everywhere in the world. They are frequently not committed to the project in full time.

ADD REPLYlink written 9.4 years ago by lh332k
3
gravatar for Joachim
9.4 years ago by
Joachim2.9k
San Francisco, California
Joachim2.9k wrote:

I can make two contributions here:

ADD COMMENTlink written 9.4 years ago by Joachim2.9k

Thanks for this, Joachim. I'll also add the flystockdb to the table above.

ADD REPLYlink written 9.4 years ago by Tomer Altman40

Thanks in return. :)

ADD REPLYlink written 9.4 years ago by Joachim2.9k
3
gravatar for Laura
9.4 years ago by
Laura1.7k
Cambridge UK
Laura1.7k wrote:

Both Major Genome Browsers Ensembl and UCSC are very long standing and large bioinformatics projects, Ensembl had its 10 birthday last year

ADD COMMENTlink written 9.4 years ago by Laura1.7k

Good point! And in that vein - think of the management of biological data that goes on at NCBI. NCBI is 22 years old; GenBank is 28 years old.

ADD REPLYlink written 9.4 years ago by Larry_Parnell16k
2
gravatar for Chris Evelo
9.4 years ago by
Chris Evelo10k
Maastricht, The Netherlands
Chris Evelo10k wrote:

If it is about open source and online available projects you could use [?]Ohloh[?] which is described [?]here[?]. We found it very useful to estimate the value of existing projects (which comes in handy in grant applications to extend them) and also to find license conflicts between the main project and libraries used. (Although to be honest we have not found a good way to deal with that when they are found).

ADD COMMENTlink modified 9.4 years ago • written 9.4 years ago by Chris Evelo10k
1

Interesting web-site. Even though I find it suspect that it contains very outdated information about BioMart (it is OICR+EBI for years now, not EBI+CSHL) and its claim that BioMart has "2" users is an understatement.

ADD REPLYlink written 9.4 years ago by Joachim2.9k
1

The license conflicts are nasty. Larger projects have their own licenses and are build on software libraries that also have their own. Some of those underlying licenses are viral. Meaning if you use the code the license propagates automatically. At that stage you have two licenses for the same code, and they may conflict. Ohloh finds these conflicts in your code base. But of course leaves it up to you how to deal with it.

ADD REPLYlink written 9.4 years ago by Chris Evelo10k

Thanks for your comment. I think you meant to write Ohloh, which I am familiar with and impressed with. Could you describe more by what you mean by "comes in handly" for grant applications, and "find license conflicts"?

ADD REPLYlink written 9.4 years ago by Tomer Altman40

Yes, you are right. Sorry for the typo. The links were correct though. In grant applications you want to do something new. For us that could be something like the integration of miRNA's in pathways like Larry asked about recently. Now it helps to show that you will incorporate something new in something that already exists and already had 20 man years of work in it. So the funding agency essentially gets 20 years of development for free.

ADD REPLYlink written 9.4 years ago by Chris Evelo10k

I really hate GPL for this reason. BTW, when Debian decides to include a project, it will check the license very carefully.

ADD REPLYlink written 9.4 years ago by lh332k
1
gravatar for Larry_Parnell
9.4 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

My experience indicates that LIMS (Laboratory Information Management Systems) are quite large - in terms of hours spent discussing features and writing code and money spent. Well, time = money, right? In a company where I once worked, we were a bioinformatics team of about 14 on average. I recall that there were 6 to 8 people working on LIMS. That was for a genomics-oriented biotech company. If you carry such efforts to management of patient data in a hospital setting, I would expect the situation to be even grander in scope and cost.

ADD COMMENTlink written 9.4 years ago by Larry_Parnell16k

I think you are quite likely right that the biggest software projects will be data management systems of various sorts. The answer to the original question thus boils down to "that depends on your definition of bioinformatics.

ADD REPLYlink written 9.4 years ago by Lars Juhl Jensen11k

Do you have a very rough idea how many lines of codes? 1 million or 10 million?

ADD REPLYlink written 9.4 years ago by lh332k

No idea (lines of code) - as I was in the analysis group and not in LIMS. If the LIMS is substantially or partly engineered to use or process biological data, then it certainly can sit under a bioinformatics umbrella. Indeed, there was some distinction with our computational group between developers (mostly LIMS) and the analysis group (also wrote code, but also interpreted data).

ADD REPLYlink written 9.4 years ago by Larry_Parnell16k

Does anybody know of particular LIMS platforms that are distributed, whether open source or commercial?

ADD REPLYlink written 9.4 years ago by Tomer Altman40
0
gravatar for Pierre Lindenbaum
9.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

my 2 cents for the NCI C toolkit:

find ./ -type f| egrep  '\.(c|cpp|h|hh|1|xml|java)$'  | grep -v '/doc/' | xargs cat | wc
1824761 5916928 54595812
ADD COMMENTlink written 9.4 years ago by Pierre Lindenbaum129k

Is this distinct from the NCBI C++ toolkit reported above? Do you know if the NCBI C toolkit & NCBI C++ toolkits overlap one another?

ADD REPLYlink written 9.4 years ago by Tomer Altman40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 891 users visited in the last hour