The top answer in What are the biggest challenges bioinformaticians have with data analysis? motivated me to ask this question. Personally, I have tried numerous software and also frequently have problems with installation. I have my own answers to this question, but I'd more like to hear others' opinions - What is the major obstacle in your experience? Is it biased towards particular programming languages? How do you think we can improve? Is Docker the savior? Do you have examples of easy-to-install tools you wish other tools could follow? I might write a blog post to give my view, leveraging your opinions.
Academic software is hard to install because the ease of installation and usability are not required or rewarded directly.
When developing a commercial software the users can "reward" or "punish" the developers immediately and measurably. In a research paper all one needs to convince are three reviewers - in addition most of them may be experts and are not representative of the actual audience.
Recent experience: claim: "our software installs easily and seamlessly via cmake" - ok I have that - turns out CMAKE had to be above a certain version, but that new version only compiles if GCC was above a certain version, that in turn required a complete reinstall of all GLIBC and a whole lot of other tools.
Adding insult to injury turns out an older CMAKE would work just as well, there was nothing there that would have made use of a newer feature of CMAKE, only that this was the version the developer happened to have on their machine so that was the version that got baked into the requirements.
A company has to sell its produced software as a product. If the software is not easy to install and nobody uses it, then nobody buys it, and the company fails.
Academic software is usually aimed at the same persons who produces it. The software is released more as a way to improve reproducibility, than to get an economic gain from it.
My most frequent annoyance is getting python programs installed on the cluster where I don't have root permissions to get things like numpy installed. Then you end up needing to muck with PYTHONPATH and friends to get everything installed yourself. It's not difficult, but it's an annoyance (more so when the admin makes you remove things from drives shared across nodes when you're not actually running jobs). I should probably just write a script to automate this sort of thing.
There are several reasons behind this; i can discuss few here
- Most of these written in Unix background and therefore you should have good knowledge of Unix
- The dependency of the software on other libraries. Most of these library are difficult to install and you should understand Unix OS for installing these
- The installation procedure is completely different from Windows
- Most of times you don't have permission to install in root directory and therefore, you forced to install it locally (Eg. On HPC). This is most difficult task for most of new bioinformatician.
- Most of software don't have GUI.
When I program for myself:
- Pros: Client knows exactly what he wants, and can convey the problem to me reasonably well. Client is always lenient with deadlines.
- Cons: Client doesn't very pay well...
When I program for the Biotech industry:
- Pros: Client pays really well!
- Cons: Client often wants 'something just like X, but better'. Deadlines are often more important to the client than the product.
When I program for Academics:
- Pros: ???
- Cons: Client offended if you ask for payment. I feel bad for even asking. Client always prefers 'something just like what was used in publication X, but better'. Client doesn't mention deadlines to you, but Client may or may not suddenly publish a program just like yours without notice, at any moment.
Of course, I'm just joking. The real reason Academic programs are so hard to install/use is because writing code that is user friendly takes almost as much time as writing a novel algorithm in the first place. For the project I just finished/posted last week on Biostars (metaflagstat.py - painless bam/sam read flag counting), the programming time could be broken down like:
- 1 day writing the code (in python) that creates the flag-count matrix (the main 'novel' or useful feature of the program).
- 2+ days fiddling with the HTML/CSS to make the plots look nice.
- 1 week cleaning up the above code, annotating it, swapping out non-core python packages like "requests" for clunky but core ones like "urllib"
- 2 days getting all the code hosted on the web, filming the how-to video, and posting about it on here.
So as you can see, I had a functioning program in just 3 days. I then spent over 10 days making it easy for others to use. That's 80% of the time I spent on the tool! :P
So bottom line is, if you want programmers who write these programs to make them more user-friendly, you have to accept that they'll probably only make 20% of the stuff that they used to make, because such a large proportion of their time is spent on 'User Experience' stuff like the installer, code-annotation, well written documentation, user interface, automatic update-checking, etc.
Personally, I have never felt this difficulty. I have quite a bit of XP installing locally as well as installing in a user-specific fashion on HPC. Honestly, if you give me a
.tgz or if it is available as a module in
apt-get or on
homebrew, it doesn't get simpler.
I think the most important things developers should consider is the target OS (and version), the programming language (and version), and the libraries (and the versions) you use when building a tool. All of these challenges can be thoroughly tested using continuous integration, but to take full of advantage of those features we would need to write tests to cover all the methods and statements, and to write good tests we need documentation. Therefore, I think it would take a lot of work to get most academic software to the point of using continuous integration because I can't think of many toolkits that actually have a test suite and documentation (not just a readthedocs webpage because that is not helpful at the command line or for developing).
Getting there is much easier than you might think though, every Perl package I build has tests to check syntax, formatting, and check that every method is documented, so I can focus on more important things. Of course, a prerequisite to building a system like this is to use version control because you can't maintain a script/package from just your computer and expect it to work in another environment.
You asked for examples, so I will share some features of Perl that I wish more people in bioinformatics knew about. Since Perl doesn't consider whitespace significant you can use tools to pack your dependencies into a single file, so every script/app is dependency-free. This also works with C libraries, you can compile a Perl script and the C libraries it uses into a single binary. Also, there are some fantastic admin-free tools like perlbrew and cpanminus. Both of those tools are examples of my previous point, they can be installed with a single curl command. Therefore, no one should ever have to use "sudo" to do anything with Perl. From a developer's perspective, you can easily manage many versions of Perl with Perlbrew or plenv, and use them all to compile a script with a single command (though I use CI to do that for me). With cpanminus, I use a cpanfile (which is a dependency spec just like a gemfile for Ruby), so I can pin my installation to ensure one library version doesn't break the build. There other tools for managing dependencies also, like Carton and Stratopan, and those allow you to ensure that "version 0.07" always ships with the same libraries, not newer ones that the package manager decides to install.
I don't have a general solution for shipping core dependencies like graphics libraries, I wish there was one. You would think that people who use these in industry have clever methods for updating and deploying these tools rapidly (maybe Docker, as you said).