OK, after reading over these I'm going to contribute another answer. Now you can legitimately say that if you want x + 1 opinions on bioinformatics pipeline development, ask x bioinformaticians.
What I see is that we all have different notions of what a "pipeline" really is. And that's at the core of the diversity of advice.
So is a pipeline:
- a way to run a file through a series of preexisting tools, perhaps with context-specific parameters, a la a Galaxy workflow?
- a comprehensive analysis package seeking to answer a specific problem, in which most or all of the analysis algorithms are specifically written for said package?
- a quick and dirty way to generate stats for a large group of files?
In my experience, a pipeline is all of the above, and more. My answer was based on my personal experience writing software to fulfill the specific needs at my workplace.
So instead of recommending a specific language, I'm going to ask you to take a step back from asking about language specifics and first ask yourself what specifically you want to accomplish. What is your definition of a pipeline? What are the requirements for your pipeline? Some good questions to guide you if you're not sure where to begin:
- What data are you starting with?
- What are the endpoints you want to attain?
- Are there any existing tools which can take you through some or all of these steps?
- How much will the pathways your data take vary between operations of the pipeline?
- What can go wrong at each step? How should you handle these exceptions when they occur?
- How, ideally, will you tell the pipeline what to do for a given datafile?
- Do you have access to a compute cluster?
All of these questions will help you pick a strategy and accompanying language.
If you simply want to take a bunch of FASTQ files that are made in the same way and use the same reference, and get sorted BAMs, then you can accomplish that in bash by piping between preexisting tools like BWA (or bowtie or tophat) with very little development time. You can actually do quite a bit of things with Bash piping. Look up posts from resident bash badass umer.zeeshan.ijaz for some inspiration.
Some operations that seem difficult or ostensibly require several steps can actually be easily accomplished with clever tools and libraries. For example, R's bioconductor package has some very powerful routines that can help you extract specific data from BAM files.
On the other hand, if you work at a facility or company where you're going to get many different types of data and need to have an automated way to send those data through one of many possible analysis pathways, you're going to want to build something that tracks the samples over time. That was my conception of a "pipeline" and that's why I chose Python. In my piipelines, I track datafiles as they go through the pipeline, and create and dispatch jobs to a compute cluster. Since I'm using preexisting tools at most of the steps, Python's structural limitations aren't a factor for me.
If you're doing something novel with the data, then you might have to write your own analysis routines rather than guiding and tracking the data through preexisting software. At that point something like perl or Python becomes a limiting factor (in the case of Python, the ol' Global Interpreter Lock becomes a major consideration, and C++ and Java start to look more appealing).
An ounce of research into preexisting solutions can be worth a pound of code. :)