This is a very classic question: Which is your favorite programming language in bioinformatics? Which languages would you recommend to a student wishing to enter the world of bioinformatics?
This topic has already been discussed on the Internet, but I think it would be nice to discuss it here. Here there are some links to previous polls and discussions:
I think the emphasis should be more on the way we optimize our program rather than language which we use. I personally use languages based on the kind of problem I am answering.
This was an interesting paper which I came across some time back although some of the information mentioned in here might sound redundant to some of you but still it's worth a read.
You will also need to be familiar with projects like R and Bioconductor, since a lot of the work will involve providing the computational infrastructure for analyzing data. In addition, you’ll need to know about data formats (fasta, sbml, mmcif…), software toolkits and libraries (Paup, Phylip, EMBOSS, BioPerl…), databases (Ensembl, InterPro, PDB, KEGG…), webservers and portals (Pubmed, ISCB).
Finally keep in mind best practices (like refraining from reinventing the wheel), but above all, give yourself the time to enjoy the learning process. Getting to the top usually takes longer than staying at the top; so what’s the point if you haven’t enjoyed the trip?
The choice of a programming language is purely subjective, but when a student asks you which programming language he should start with, you have to make an answer, or at least provide some informations.
I think that a bioinformatician who studies R and at least two or three libraries (lattice/ggplot2, plyr) early can have an advantage, because he will be able to represent his data properly and obtain good results without too much effort. If your supervisor is not a computer scientist, he will be a lot more impressed by plots and charts than by programs, even if they are well written, with unittests etc.
Python is a good programming language to learn as a general purpose tool. Its bigger advantages are its easy to read syntax, and its paradigm 'there is only one way to do it', so the number of language keywords is reduced to the minimum, and two programs with the same function written by different people will be very similar (which is what doesn't happen with perl). The negative points of python are that its CSV files reading/plotting interface is not ready yet (the best is pylab), so you must rely on R to produce nice plots.
Honestly I don't like perl, because I think it can induce to many bad-behaviours in novel programmers. For example, in perl there are many similar constructs to accomplish the same objective: so, it is very difficult to understand a program written by someone else, because you have to known all the possible constructs and hope there are enough comments. It is already very difficult to reproduce a bioinformatician experiment, if you write your code in a difficult language it is a lot worst. Moreover, I know of many people who have been using perl for years, but that don't even use functions, because it looks too complicated. How can it be? It looks very inefficient. The only good point of perl is its repositories, bioperl and CPAN; however, I know of people using perl that don't even know of the existence of these, so I don't understand why they keep going with perl.
Apart from programming language, is it very useful to learn the basic usage of gnu-make, or of a derivate. This program is very useful when you have lot of different scripts, as it allows you to define a pipeline in order to run them. Some basic bash commands may also be very useful if you work with a lot of flat files (head, sed, gawk, grep, ...)
There are different paths to a good programmer in Bioinformatics. You need to figure out what suits you best. As most of the answers above focus on one path, I am talking about a different one.
If we ask what programming languages those famous computational biologists (e.g. Richard Durbin, Lincoln Stein, Ewan Birney, Jim Kent, Mike Brudno, Sean Eddy, Nick Patterson, Goncalo Abecasis, Gene Myers and more) use, the list is quite small: C/C++ and shell by all, and Perl by most. It is true that 15 years ago we did not have many choices, but this implies that learning even the old-school programming languages like C and Perl (just two) is sufficient to make you a good computational biologist. Some regard the versatility of programming languages makes one a good programmer, but this might not be necessarily true. To me, what is important is the thinking instead of knowledges.
Another trend behind the famous guys is they do not heavily rely on libraries. Many, if not all, of them implemented their own libraries from the scratch, including hash tables, search trees, string manipulation, sorting, special functions, random number generators, statistical tests, format parsing, basic sequence alignment and more. Doing this may be considered by some as "reinventing the wheel", but to me this is why they are successful. Personally, I do not think one can grasp the essence of programming and master the skills, unless (s)he fully comprehends the fundamental elements which can only achieved by reimplementing by oneself.
Taking this path requires years of learning and much more efforts than learning a scripting language and using libraries, but in the long run, these efforts will pay off.
Of course, whether to take this path depends on your own strengths and interests. For the majority, answers by others are more suitable. I am just giving a minor alternative.
EDIT: About Perl and Python. I used to try Python2.6, but came back to Perl in the end. The most important reason, which we have frequently overlooked, is that Python's regex engine is very very slow. For simple regex (not for complex), it can be 10X slower than Perl. While I use scripting languages mainly for format parsing, being 10X slower is unacceptable.
It is important to be considerate and not characterize one particular approach negatively. My favorite quote is:
Programming is pure thought.
Hopefully everyone is able to pick an approach that matches their individual way of thinking. While I myself do not program in Perl, I consider it to be one of the most popular and powerful platforms for doing bioinformatics analysis.
Perl can be quite lovely if you choose to write it well. If you find yourself in need of writing some perl, I'd highly recommend getting the Perl Best Practices book and going through it to learn how to make your perl code not suck. Essential tools for helping with that are perlcritic and perltidy, both of which I have bound to quick keystrokes in my emacs cperl-mode so as to make sure my code is in reasonably good shape. There's lots of blog articles out there about writing "Modern Perl" or "Enlightened Perl" that help make the language not just bearable but actually quite nice for a certain type of brain.
One thing that Perl does very well that no other language does is quick text processing on the command line. If you want to do some simple processing of a text file (which is pretty standard in this business), perl is a fantastic package to do so. Stringing together a set of UNIX utilities on a Linux system will usually have you running for a half dozen manpages looking for conflicting and unique switches, where with perl I find that there's far less I have to remember to get the same effect. The book Minimal Perl goes in to this sort of thing in detail (perl as a better awk/sed/grep/etc) and I highly recommend having a look. At the very least, I've found that using perl in this fashion filled a hole in my toolkit that I didn't even realize was there. R and Python can, of course, do this sort of thing too, but not nearly so well as Perl.
I think you need different kind of programming languages for different purposes.
The bases are :
You will also have to learn how to use databases with SQL and perhaps some basic things about HTML and CSS.
But more important, learning a language is easy, but learning how to design efficient algorithms or reusable code is not so easy. And also, like Manuel Corpas said, "reinventing the wheel" is something to avoid. So you will have to know the classic algorithms and classes which are already implemented in public libraries.
I have found useful: Perl, MySQL, Unix commands and shell scripts, R, and knowing some web stuff (HTML/php).
It's good to be familiar with a variety of tools, so you can choose the right one for the problem (and not force a tool to do something it's not really designed for, just because you don't know how to do it any other way).
If I was starting out, I might consider something like ruby or python instead of perl, but maybe not. There's a lot of code out there already written in perl.
I dont have any preconceived notion regarding the best programming language but rather about approaches and requirements. Bioinformatics Programming in an art of expressing scientific fields so can differs on an individual/organization needs. Moreover, best programming language is about best practices appraoches.
We can consider best programming langauge depending on:
Python/Biophython- Easy, efficient and agile methodology
Perl - Perl and Python can't be compared because they both have their own pros and cons. But, I will always go with Python simple yet elegant.
Java/C/C++ - Hard to code and lacks rapid development.
Good to have a strong background in general programming concepts. Choice of languags depends on the nature of your projects. In general a mixed bag of programming skills in domains like scripting (take your pick : Perl, Python, Ruby), Web(Lot of JScript, CSS, Perl / PHP), Databases (MySQL, PgSQL), statistics(Mostly R / Matlab) with c / C++ / Java will be an excellent combination.
Most languages can do all of the basic things you want. To try and argue which language is best for bioinformatics is completely subjective. That being said there are some points I think are worthwhile to consider.
One thing that's very important is the list of supporting libraries that you will require for your work. Not all programming languages have the same library support as others. You may need some very specific statistical methods, machine learning libraries or some high quality graphing libraries etc etc. Nothing can be more frustrating than to be using one language to find out that it lacks bindings for a very useful library (Writing them yourself can be a huge pain). It can also be a huge waste of time to have to rewrite that library in your native language.
Nevertheless, once you've started using one language it's very easy to adapt to other languages quickly. Most of them have very common features, it's just getting used to differences in their syntax.
Now, I'll give my personal opinion about why I think python is a good language to start with for bioinformatics. The language its self is much easier to read than perl, check out the zen of python. These are the kinds of ideals I try to follow when I write my code and it's nice using a language that tries to follow them (a lot of the python libraries try to follow them as well).
The library support for python is growing steadily, for example, the scipy, numpy, matplotlib libraries allow you to a lot of math, stats and graph generation. Everything I needed to do with R for my work I could do using numpy / matplotlib and the syntax wasn't nearly as frustrating. There is also BioPython which is doing quite well. As I said in the beginning of the post, this is one of the most important points when choosing a language to me. It's nice not having to reinvent the wheel all of the time.
Finally, it's very easy to replace bottle necks in your python code with C (with numpy you can even embed performance critical C code directly in your python program) or C++ code.
Just my two cents and since nobody mentioned yet, I'm using MATLAB. Yes, it's commercial and expensive. It might be behind R/Bioconductor in amount of contributed algorithms (this is why I sometime have to use R as well). But the environment is very friendly for fast development, figures are great, and making GUIs is pretty easy. Many useful for bioinformatician toolboxes, like Statistics, Bioinformatics, Optimization. Someone may find SimBiology cool (although I haven't used it). As others mentioned Perl is still rules for text processing and workflows, although I agree with giovanni on its problems.
Hi ! Briefly, I use R/Bioconductor, combined with PHP and MySQL. PHP can encapsulate programmes/scripts from other languages, such as Perl and Java, and can easly manage webapps by URL. Also nice for UI and run on Linux/MacOSX/Window$.
Just was going through this article thought to post it here.
Well just my 50 cents here. In our lab we prefer to write the core of our algorithms in Java, which results in portable, stable and really computationally fast software. Writing in Java is quite a slow process compared to scripting languages, but its quite easy and you are going to learn best practices fast. Of course most of time in bioinformatics you need to write lots of scripts for custom data handling. We have chosen Groovy scripting language, as it allows native Java module support and is also more computationally efficient compared to Perl and Python. And of course various text processing options with Groovy are comparable to Perl.
I agree language is mostly a matter of personal taste, but I'd like to vote for Clojure as an excellent language for bioinformatics. For me personally, the combination of Clojure with the Light Table IDE is simply a joy to use. Light Table also supports python by the way. I would recommend everyone to try it out for yourself, as I could never do the power Clojure and LT justice in a few paragraphs on this forum.