What is the best programming language for NGS data analysis pipeline development?
13
9
Entering edit mode
8.4 years ago
jack ▴ 940

I want to know, which programming language is the best to use for NGS data analysis pipeline development.

I do the programming with R, but it has very low speed and its memory management are crazy.

RNA-Seq Assembly • 11k views
3
Entering edit mode

for me is the bash, awk, python and exactly R..

0
Entering edit mode

+1 on that. I think that bash,awk and R will do most of the job for gluing programs together etc...

0
Entering edit mode
7
Entering edit mode
8.4 years ago
JC 13k

I am going with anything you can understand and develop deeply, Perl, Python, Bash, Make, ...

5
Entering edit mode
8.4 years ago
Dan D 7.3k

OK, after reading over these I'm going to contribute another answer. Now you can legitimately say that if you want x + 1 opinions on bioinformatics pipeline development, ask x bioinformaticians.

What I see is that we all have different notions of what a "pipeline" really is. And that's at the core of the diversity of advice.

## So is a pipeline:

• a way to run a file through a series of preexisting tools, perhaps with context-specific parameters, a la a Galaxy workflow?
• a comprehensive analysis package seeking to answer a specific problem, in which most or all of the analysis algorithms are specifically written for said package?
• a quick and dirty way to generate stats for a large group of files?

Yes!

In my experience, a pipeline is all of the above, and more. My answer was based on my personal experience writing software to fulfill the specific needs at my workplace.

So instead of recommending a specific language, I'm going to ask you to take a step back from asking about language specifics and first ask yourself what specifically you want to accomplish. What is your definition of a pipeline? What are the requirements for your pipeline? Some good questions to guide you if you're not sure where to begin:

• What data are you starting with?
• What are the endpoints you want to attain?
• Are there any existing tools which can take you through some or all of these steps?
• How much will the pathways your data take vary between operations of the pipeline?
• What can go wrong at each step? How should you handle these exceptions when they occur?
• How, ideally, will you tell the pipeline what to do for a given datafile?

All of these questions will help you pick a strategy and accompanying language.

If you simply want to take a bunch of FASTQ files that are made in the same way and use the same reference, and get sorted BAMs, then you can accomplish that in bash by piping between preexisting tools like BWA (or bowtie or tophat) with very little development time. You can actually do quite a bit of things with Bash piping. Look up posts from resident bash badass umer.zeeshan.ijaz for some inspiration.

Some operations that seem difficult or ostensibly require several steps can actually be easily accomplished with clever tools and libraries. For example, R's bioconductor package has some very powerful routines that can help you extract specific data from BAM files.

On the other hand, if you work at a facility or company where you're going to get many different types of data and need to have an automated way to send those data through one of many possible analysis pathways, you're going to want to build something that tracks the samples over time. That was my conception of a "pipeline" and that's why I chose Python. In my piipelines, I track datafiles as they go through the pipeline, and create and dispatch jobs to a compute cluster. Since I'm using preexisting tools at most of the steps, Python's structural limitations aren't a factor for me.

If you're doing something novel with the data, then you might have to write your own analysis routines rather than guiding and tracking the data through preexisting software. At that point something like perl or Python becomes a limiting factor (in the case of Python, the ol' Global Interpreter Lock becomes a major consideration, and C++ and Java start to look more appealing).

An ounce of research into preexisting solutions can be worth a pound of code. :)

0
Entering edit mode

it was greate recommendation, thanks

4
Entering edit mode
8.4 years ago
Dan D 7.3k

I've written linux-based cluster pipelines in bash, perl. and python. Of those, I prefer Python. If you want database connectivity, the ease of starting command line processes, along with the power of basic OOP, then Python is a good comprehensive choice for a sophisticated pipeline.

Consider that with a pipeline, you're likely chaining together existing tools in favor of writing analysis programs. Thus your main tasks are going to be assessing what tasks need to be performed, how those tasks should be performed, and then executing and tracking those tasks. You can start out quick and dirty with Perl and get things done quickly by writing static tracking and flag files which can be written and parsed to move jobs along. Python is almost as easy here, too.

If you want a deeper level of sophistication, however, you'll want to push task tracking to a database. This will give you the ability to model and track tasks in a way that's much cleaner and easier to manipulate than writing static files to a filesystem. That's when you'll find Python's available libraries for integrating with databases to be a lifesaver. I tried it with Perl but it became really painful and I bit the bullet and learned Python after avoiding it for many years.

3
Entering edit mode
8.4 years ago

There's never going to be any one "best" language. Example previous similar posts would include:

They aren't all specifically on pipelines, but will generally at least make reference to them.

3
Entering edit mode
8.4 years ago

I'd avoid using Python for building pipelines, as you need to be able to construct and manage threads to reliably pass data from one process to another without I/O blocking. It's doable, but the Python equivalent of a bash one-liner can quickly become many tens to hundreds of lines of code that is more difficult to troubleshoot and maintain.

I would, however, use Python as a component within a pipeline, as it provides a lot of standard data containers and useful native and third-party frameworks for manipulating data input in various formats.

Contrariwise, I would use bash (or a shell of your choice) within a makefile for building pipelines, as it embodies the UNIX principle of stringing small tasks together to do big things, and shells offer a number of convenient built-ins to make pipelines expressive and self-documenting.

On the same note, while bash/make is great for pipeline building, I would not use bash for anything more than simple components within a pipeline as it is difficult to store data in anything more complex than a 1D list.

I find Perl is a good "middle-ground" between Python and bash/make approaches, in that it allows easy shell integration and ridiculously easy data structure construction. For instance, if you want to make a hash table in an array in a hash table, you can do it without first defining the structure:

\$fooRef->{someKey}->[1234]->{anotherKey} = "bar";


In Python, you'll have to initialize the structures at each level. On the other hand, while this sort of thing makes writing Perl scripts faster, it can make those scripts more difficult to maintain. Six of one, half dozen of another.

4
Entering edit mode

ok you recommend pure Perl over pure Python for pipelines because it offers auto-vivification of hashes? I need a beer.

0
Entering edit mode

If you want to write a quick and dirty script that you'll rarely need to go back to, or start processes without hundreds of lines of code, Perl is a real time-saver. If you need to make lots of tweaks or want to learn threading, write a class in Python and subclass as needed. :)

2
Entering edit mode

The statement about needing to initialize each level of the data structure initially in Python is actually incorrect. There are two ways to avoid having to do this depending on your access methods. If, for instance, you don't expect to ever access something by key or list position that doesn't exist, you can simply not worry about it at all. If, on the other hand, you want to be robust this is where things like Python's DefaultDict come into play.

3
Entering edit mode
8.4 years ago

The suitability of different languages always depends on the tasks. The general purpose programming languages can be used to implement individual steps of an analysis just as Alex suggested. The orchestration of these steps i.e. the execution and the management of the pipeline is better left to a real workflow engine, which there are many. For the sake of modularity and the associated benefits, it is good to keep tasks separate from their hosting pipelines.

My personal favorite has been Anduril (partially because we've made it). Some steps of an NGS pipeline are often computationally demanding thus I reckon parallelization and fault tolerance highly appreciated. Although you may need a reasonable amount of disk space for your analysis I still find it handy that all intermediated results are well organized and accessible to an end user. The development of new Anduril pipelines and components is eased by the extensive support of testing and a large set of invariants such as syntax, structure, and type check of pipelines and their elements.

I would recommend Anduril for those willing to cover everything from the data collection to the polished representation of end results in their pipelines. The prerequisites include some programming skills and the target audience consist of those developing the pipelines and components. Anduril does not have a real GUI but comes along with an expressive scripting language, which enables construction and management of very complex (possibly thousands of components with a vast number of conditional properties and dependencies) pipelines. Anduril is not necessarily the easiest to learn but its flexible nature can be adjusted nicely to almost any batch like process. You may consider other frameworks if you are interested in streaming data between the subsequent components or if you need a continuous and iterative process, where the upstream is fed with the downstream results.

2
Entering edit mode
8.4 years ago
Prakki Rama ★ 2.6k

Never regretted learning PERL

1
Entering edit mode
8.4 years ago
alistairnward ▴ 210

As already mentioned, the language depends on the analysis that you are attempting to perform. There are a number of projects out there that attempt to handle provide pipelines for you. For example, see the gkno project gkno.me) for a set of tools and pipelines. Developing pipelines can be achieved by creating or modifying configuration files.

1
Entering edit mode
8.4 years ago

Although your complaints about R can be valid, I think it can be quite helpful for pipelines because of Bioconductor. I find this helps spread the word about a pipeline (I personally get more downloads for the same tool when hosted on Bioconductor versus sourceforge), and it makes dealing with dependencies a lot easier because of they are automatically installed with your package.

Plus, the code doesn't have to be done entirely in R - so far, every Bioconductor package that I have written has some portion written in Perl (and I think it is also fairly common to have portions of Bioconductor packages written in C++, etc.)

1
Entering edit mode
8.4 years ago
Christian ★ 3.0k

For developing pipelines, I strongly recommend using GNU/Make or something equivalent, because it takes care of a lot of pipeline-related issues that otherwise need to be implemented from scratch (parallelization, incremental builds after modification/errors, error status checking, etc.). Within the Make framework, you can throw in scripts in whatever programming language you prefer (Perl, Python, R, ...).

This is a great place to get started: http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/

1
Entering edit mode
8.4 years ago
pld 5.0k

Whichever language you're most comfortable with and whichever language has the features, performance or other facets that best fit your needs. If memory consumption is a concern, move to a language that offers more control over memory usage (C, C++). If flexibility is an issue, move to python or perl. If you're trying to win nerd points, write it in haskell. If you want to add parallelization, consider which languages have wrappers to whichever MPI flavor you're thinking of using.

IMO, anything but R. Especially if you desire improved memory consumption and performance. Python and perl are great for management of data (moving, formatting, simple processing), but for more intensive applications I think it is worth spending the time and writing those parts in C or C++. A pipeline is an investment, spend the time now to make it worth your while.

1
Entering edit mode
8.4 years ago

I'll add Java since nobody have mentioned it yet :) It is fast and cross-platform. You can easily develop even complex applications with it, as the language is quite strict and object-oriented, so no problems with debugging and unit-testing. There are many scientific libraries available for it, including ones for NGS data analysis, e.g. Picard API. It allows flexible dependency management with tools such as Maven. You can then call Java libraries with ease from scripting languages, such as Groovy and JavaScript, which will play the role of Perl/Python in your pipeline. And don't forget to learn how to use version control (VCS) such as Git or BitBucket, this is also really important for providing stable software.

1
Entering edit mode
8.4 years ago
Björn ▴ 670

If reproducibility, transparency and accessibility are key factors of your pipeline I would recommend the Galaxy API for pipeline development. Maybe that sounds strange, but Galaxy provides many different abstractions layers that will save you hours of work.

For example do you care about tool versions and reproducibility? Galaxy Tool Shed is the answer, and many tools are already integrated by a very nice community.

Do you care about cluster intergation? Galaxy supports many different cluster setups, inlcuding AWS and Open Cloud.

Is visualization one of your end-points? Galaxy offers integration with UCSC, IGV … and supports build-in visulations in your browser.

All of that can be controlled by the Galaxy REST API and that is independent of your programming language. You can use perl, Java, python or if you like pure curl statements from your bash. What ever you are using, it will be tracked inside of Galaxy and it will ensure reproducibility and transparency over time. For me that is one of the key factors in pipelining.