Question: Largest Bioinformatics Software Project?
3
gravatar for Tomer Altman
8.5 years ago by
Tomer Altman40
Tomer Altman40 wrote:

I'm trying to figure out which bioinformatics software project is the largest, and how much effort goes into its development. It would be great if people could post what they know about software projects that they work on or use. The three main attributes that I am trying to figure out for each project are:

  • How many lines-of-code does the software consist of?
  • What are the primary languages that the software is written in?
  • How many full-time employees (FTEs) are devoted to the development of the software?

Usage statistics for the project (i.e., size of the user community) for the software would also be great.

I'm interested in both open-source and proprietary software.

Thanks!

software comparison • 3.4k views
ADD COMMENTlink modified 8.5 years ago by lh331k • written 8.5 years ago by Tomer Altman40
4

Comparing the number of lines in projects is no indication of the projects size. For example, the number of lines in languages that require braces will inflate the line count significantly. Furthermore, whitespace and comments can also inflate the count.

ADD REPLYlink written 8.5 years ago by Gww2.7k
2

An interesting though controversial question. For people who want to post answers, I would recommend to use the same program to count lines-of-code. `wc -l' seems too primitive. I would recommend cloc.pl (a single perl script): http://sourceforge.net/projects/cloc/files/cloc/v1.53/

ADD REPLYlink written 8.5 years ago by lh331k
2

I don't think that the number of lines of code is a good measure of the amount of effort went into a project. Writing compact, efficient, reusable code takes orders of magnitude more time that writing bloated, inefficient code with lots of code duplication due to bad design.

ADD REPLYlink written 8.5 years ago by Lars Juhl Jensen11k

I disagree. Whilst both of you are right that loc are an ambiguous metric, it is still a fantastic estimate on a project's size when looking at the order-of-magnitute. Obviously, a project with 1k loc is significantly smaller than a 1m loc project -- no matter how you factor in braces, comments, and generous use of blank lines.

ADD REPLYlink written 8.5 years ago by Joachim2.8k

I entirely agree LOC is not a perfect measurement (actually all my projects will be underestimated by LOC), but it is at least a measurement and frequently not so misleading. How can we prove a LIMS is the largest project without measuring it?

ADD REPLYlink written 8.5 years ago by lh331k

Using LOC is just fine for T-Shirt sizing (S, M, L) software projects. But implies that you either have access to a published LOC stat or the actual source code. Could you infer the size of projects based on published results or some other published metric?

What if you did this: 1. Do a Google search to get a list of bioinformatics software. 2. Create a Google Mashup to auto search each of the titles, record the hit count. 3. Use the number of Google hit metric to at least infer the popularity of the software.

ADD REPLYlink written 8.5 years ago by Ben Lange190

Can some admin close & purge this question + answers? Apparently it has become a discussion about locs and no one really addresses the Tomer's points about language usage and full-time employees.

ADD REPLYlink written 8.5 years ago by Joachim2.8k

Please do not close this question. The comments here are all about LOC, but the answers not.

ADD REPLYlink written 8.5 years ago by lh331k

Thanks to everyone for your energetic replies. I think you have all won me over to biostar!

My coworker recommended the following LoC tool: http://www.dwheeler.com/sloccount/

This is the one of Linux kernel fame. I'm curious as to how it fares against cloc.pl & other tools.

I'm with everyone regarding LoC not being the end-all of software complexity/size/feature metrics, but it's a useful if imperfect one. @Ben, thanks for the recommended approach. While there might be some noise in that approach, I'll definitely add a column for measures of the user community.

ADD REPLYlink written 8.5 years ago by Tomer Altman40

cloc.pl uses source code from SLOCCount. I believe cloc.pl learns from SLOCCount.

ADD REPLYlink written 8.5 years ago by lh331k
11
gravatar for lh3
8.5 years ago by
lh331k
United States
lh331k wrote:

The following statistics come from ohloh or from cloc.pl count. [?]


Project Language Code Comment Blank Date/Ver Source FTEs

Bioclipse Java 578,095 349,515 154,338 04/02/2011 Ohloh ? Bioconductor R/C/C++ 1,248,634 276,358 218,222 03/30/2011 cloc+awk ? BioJava Java 272,864 129,237 59,074 03/30/2011 Ohloh ? BioMart Java/Perl 98,637 43,231 24,346 03/30/2011 Ohloh ? BioPerl Perl 323,007 258,987 167,907 03/30/2011 Ohloh ? BioPython Python 120,824 39,085 22,183 03/30/2011 Ohloh ? BioRuby Ruby 68,390 27,032 15,636 03/30/2011 Ohloh ? EMBOSS C 633,014 258,265 215,110 04/02/2011 Ohloh ? flystockdb JS/Ruby 7,845 ? ? ? ? 1 JKsrc C 827,908 111,490 105,524 03/31/2011 Ohloh ? Jmol Java 213,645 58,930 28,784 03/30/2011 Ohloh ? ncbi_cxx C++/C 1,112,817 318,441 250,134 Jun_15_2010 cloc.pl ? OpenMS C++ 219,835 77,201 51,512 04/02/2011 Ohloh ? SeqAn C++/C 250,390 89,885 55,212 03/30/2011 Ohloh ? SHOGUN C++/C 128,232 53,367 33,488 04/02/2011 Ohloh ?


[?]

There are a few caveats to get the table. As the others have argued, these numbers are not a good indication of how large the project is. Just give you a very rough idea.

EDIT 03/31/2011: JKsrc from ohloh, LOCs very similar to cloc.pl results.

EDIT 04/02/2011: Updated EMBOSS with LOCs from ohloh (I modified its Enlistment list because the old one points its documentation only); added OpenMS (I modified its Enlistment list because the old one includes SVN tags and branches but we should count trunk only); added SHOGUN; added Ensembl to Ohloh, but Ohloh has problems with analyzing its repository; updated Bioclipse as Egon has updated its enlistment. Sorry to push this answer up. I just want to keep it updated. [?]

Further to demonstrate cloc.pl. I downloaded Jim Kent's source codes jksrc.zip, unzipped it and counted lines of codes with the following command line:

find -type f|egrep "\.(c|h|cpp|cc|hpp|hh|java|py|pl|pm|rb|lua|html|htm|js|php|sql)$" > file.list; cloc.pl --list-file=file.list

The output is:

[?]

This jksrc.zip is one of the largest collections of C source codes (if not the largest). It is the base of the UCSC genome browser and a lot of other utilities such as the famous BLAT.

Please include FTE estimates when available.

ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by lh331k
4

Well, I "use" UCSC genome browser and I have answered the first two points.

ADD REPLYlink written 8.5 years ago by lh331k
1

cloc.pl seems to be a very nice tool for analyzing projects.

ADD REPLYlink written 8.5 years ago by Farhat2.9k

Sorry for the downvote, but the question clearly said "It would be great if people could post what they know about software projects that they work on or use." and then gave three points of particular interest. Only one point is answered here. It is great that cloc.pl counts the loc so accurately though.

ADD REPLYlink written 8.5 years ago by Joachim2.8k

I'm updating Ohloh's 'enlistments' for Bioclipse...

ADD REPLYlink written 8.5 years ago by Egon Willighagen5.2k

OK, not all repositories added yet, but the current count on Ohloh is now: 1,192,189

ADD REPLYlink written 8.5 years ago by Egon Willighagen5.2k

I've added a column for FTE estimates. I'm not sure if there's a good way to gauge that for open source projects, since there's a distribution of participant commitment levels. Suggestions welcome.

ADD REPLYlink written 8.5 years ago by Tomer Altman40

FTE is very difficult to measure. Most of the large projects have contributors everywhere in the world. They are frequently not committed to the project in full time.

ADD REPLYlink written 8.5 years ago by lh331k
3
gravatar for Joachim
8.5 years ago by
Joachim2.8k
San Francisco, California
Joachim2.8k wrote:

I can make two contributions here:

ADD COMMENTlink written 8.5 years ago by Joachim2.8k

Thanks for this, Joachim. I'll also add the flystockdb to the table above.

ADD REPLYlink written 8.5 years ago by Tomer Altman40

Thanks in return. :)

ADD REPLYlink written 8.5 years ago by Joachim2.8k
3
gravatar for Laura
8.5 years ago by
Laura1.7k
Cambridge UK
Laura1.7k wrote:

Both Major Genome Browsers Ensembl and UCSC are very long standing and large bioinformatics projects, Ensembl had its 10 birthday last year

ADD COMMENTlink written 8.5 years ago by Laura1.7k

Good point! And in that vein - think of the management of biological data that goes on at NCBI. NCBI is 22 years old; GenBank is 28 years old.

ADD REPLYlink written 8.5 years ago by Larry_Parnell16k
2
gravatar for Chris Evelo
8.5 years ago by
Chris Evelo10.0k
Maastricht, The Netherlands
Chris Evelo10.0k wrote:

If it is about open source and online available projects you could use [?]Ohloh[?] which is described [?]here[?]. We found it very useful to estimate the value of existing projects (which comes in handy in grant applications to extend them) and also to find license conflicts between the main project and libraries used. (Although to be honest we have not found a good way to deal with that when they are found).

ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Chris Evelo10.0k
1

Interesting web-site. Even though I find it suspect that it contains very outdated information about BioMart (it is OICR+EBI for years now, not EBI+CSHL) and its claim that BioMart has "2" users is an understatement.

ADD REPLYlink written 8.5 years ago by Joachim2.8k
1

The license conflicts are nasty. Larger projects have their own licenses and are build on software libraries that also have their own. Some of those underlying licenses are viral. Meaning if you use the code the license propagates automatically. At that stage you have two licenses for the same code, and they may conflict. Ohloh finds these conflicts in your code base. But of course leaves it up to you how to deal with it.

ADD REPLYlink written 8.5 years ago by Chris Evelo10.0k

Thanks for your comment. I think you meant to write Ohloh, which I am familiar with and impressed with. Could you describe more by what you mean by "comes in handly" for grant applications, and "find license conflicts"?

ADD REPLYlink written 8.5 years ago by Tomer Altman40

Yes, you are right. Sorry for the typo. The links were correct though. In grant applications you want to do something new. For us that could be something like the integration of miRNA's in pathways like Larry asked about recently. Now it helps to show that you will incorporate something new in something that already exists and already had 20 man years of work in it. So the funding agency essentially gets 20 years of development for free.

ADD REPLYlink written 8.5 years ago by Chris Evelo10.0k

I really hate GPL for this reason. BTW, when Debian decides to include a project, it will check the license very carefully.

ADD REPLYlink written 8.5 years ago by lh331k
1
gravatar for Larry_Parnell
8.5 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

My experience indicates that LIMS (Laboratory Information Management Systems) are quite large - in terms of hours spent discussing features and writing code and money spent. Well, time = money, right? In a company where I once worked, we were a bioinformatics team of about 14 on average. I recall that there were 6 to 8 people working on LIMS. That was for a genomics-oriented biotech company. If you carry such efforts to management of patient data in a hospital setting, I would expect the situation to be even grander in scope and cost.

ADD COMMENTlink written 8.5 years ago by Larry_Parnell16k

I think you are quite likely right that the biggest software projects will be data management systems of various sorts. The answer to the original question thus boils down to "that depends on your definition of bioinformatics.

ADD REPLYlink written 8.5 years ago by Lars Juhl Jensen11k

Do you have a very rough idea how many lines of codes? 1 million or 10 million?

ADD REPLYlink written 8.5 years ago by lh331k

No idea (lines of code) - as I was in the analysis group and not in LIMS. If the LIMS is substantially or partly engineered to use or process biological data, then it certainly can sit under a bioinformatics umbrella. Indeed, there was some distinction with our computational group between developers (mostly LIMS) and the analysis group (also wrote code, but also interpreted data).

ADD REPLYlink written 8.5 years ago by Larry_Parnell16k

Does anybody know of particular LIMS platforms that are distributed, whether open source or commercial?

ADD REPLYlink written 8.5 years ago by Tomer Altman40
0
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

my 2 cents for the NCI C toolkit:

find ./ -type f| egrep  '\.(c|cpp|h|hh|1|xml|java)$'  | grep -v '/doc/' | xargs cat | wc
1824761 5916928 54595812
ADD COMMENTlink written 8.5 years ago by Pierre Lindenbaum123k

Is this distinct from the NCBI C++ toolkit reported above? Do you know if the NCBI C toolkit & NCBI C++ toolkits overlap one another?

ADD REPLYlink written 8.5 years ago by Tomer Altman40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1484 users visited in the last hour