Forum: Which programming language to integrate several open source software into one working pipeline (software)
gravatar for benjamin.hebisch
5.9 years ago by
benjamin.hebisch70 wrote:

Hey there!


I'm completly new to this programming/coding aspect in bioinformatics, althrough I'm quite successful using phylogenetic software for 2 years.


I'm planning a PhD project in which several open access tools are combined into one working pipeline/workflow/software (similar to MEGA5) to run analyses over night/week/month. Manually, this workflow cost me 1 year to analyze a bunch of proteins and to get familiar with several tools. For the long run, I want to implement this for a automatic sub proteome analysis.


Is it with any language possible to program a tool which is able to obtain data from a certain database, subject it to software A, control software A, send the output of software A to software B, control software B and so on?

The general workflow is mainly linear or has just one branching point.




mega pipeline forum • 2.6k views
ADD COMMENTlink modified 5.9 years ago by Chris Evelo10k • written 5.9 years ago by benjamin.hebisch70

Wow! Many thanks!


I will dig through those programs. Especially those mentioned by Chris Evelo seem to be so user friendly that even wet-lab scientists could work with that :D

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by benjamin.hebisch70
gravatar for Cytosine
5.9 years ago by
Ljubljana, Slovenia
Cytosine450 wrote:

You can do this with just about any programming language. Pick the one you're most comfortable with.

ADD COMMENTlink written 5.9 years ago by Cytosine450
gravatar for Chris Evelo
5.9 years ago by
Chris Evelo10k
Maastricht, The Netherlands
Chris Evelo10k wrote:

Did you look at Taverna, Knime, Bioclipse? You might not want to start a whole new project for something that has been done before and that even is open source so could be extended if it doesn't fit your needs completely.

Update. Let me explain a bit more.

People mentioned that basically every programming language OS can do this, and they are right. Make tools can make it easier for you since they were made to steer workflows, that is of course true as well. And still some things are just better at solving specific problems than others or allow you to use the parallelisation capacities of the system you use. So many people have a favorite and often for good reasons. So there are a lot of correct answers here.


I have seen many instances where using output from one program as input for  another was not so simple at all. You need fileformat changes, changes in Db identifier used, ontology mappings (for instance mappings between information in study descriptions about tissue function and cell types to find the studies that can be compared) and conversions from one standardized (?) format to another. Many questions here on Biostar are about such individual steps. The recent question about using pathways in BioPax was a nice example how quickly that can become complicated. In practice you often need to use a lot of blocks and you need *glue* in between the blocks. Things that take say BioPAX produced by one tool and produces SBGN needed by another. Creating that kind of converters to glue things together can take months of work and then often they still don't cover the fact that real data also contains format errors. So there is a big advantage to having a toolkit full of blocks and connections between blocks and tools that allow you to configure those connections.

It is unfortunately not true that those workflow tools are very simple to use. First of all you need to know what you are doing. In that respect it is useful you understand about both the tool and the biology. And then your specific problem will oftentimes still contain some really new steps, which you will have to code. Reusability comes at a price too. You need to document even better then you should do anyhow and ideally you would think more about the things that the next user might encounter. These workflow environments are in part built to force you to do that. But that sometimes makes it harder to use them than you would expect. But yes, if you collaborate with a wetlab group that kind of tools will be easier for them to use if you fix the patches for their specific problem.

The good thing really is in the reusability and thus in the sharing of solutions and building blocks. That is what a site like is for.



ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Chris Evelo10k
gravatar for Pierre Lindenbaum
5.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

I would use GNU-Make

ADD COMMENTlink written 5.9 years ago by Pierre Lindenbaum129k

What makes it special?

ADD REPLYlink written 5.9 years ago by Medhat8.7k

make is used by millions of software projects to build resources along a dependency tree. It will automatically run whichever steps are needed for your defined endpoint, after you write the Makefile .

ADD REPLYlink written 5.9 years ago by karl.stamm3.6k
gravatar for Alex Reynolds
5.9 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

Use makefiles and GNU make. There's a nice overview of using make to build and control analysis pipelines at Bioinformatics Zen.

Why use makefiles? In brief, just about anything you can run from the command line can be called from a makefile, and if you need to rerun a pipeline, only changed intermediate files trigger rebuilding of related targets (unless you force otherwise). This can reduce the time required to rerun a pipeline, and a consistent build path also reduces the odds of user errors.

Further, you need very little customization to run makefiles via GNU make, which is a toolkit already on most OSS-based systems found in bioinformatics, and make is agnostic about specialized scripting tools. It doesn't care if you use Python, Java, Perl, bash, etc., and it is robust: it doesn't share their uniquely weird and fragile version and library dependencies. Things aren't going to break if you update a minor Python version, for instance. (Well, a Python script in your pipeline might break, but that's a separate issue.)

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Alex Reynolds30k
gravatar for Biomonika (Noolean)
5.9 years ago by
State College, PA, USA
Biomonika (Noolean)3.1k wrote:

Galaxy is very popular and convenient for connecting outputs from multiple programs/scripts, e.g. by creating workflows. Look for images when you google "workflow galaxy".

ADD COMMENTlink written 5.9 years ago by Biomonika (Noolean)3.1k
gravatar for 5heikki
5.9 years ago by
5heikki8.9k wrote:

I do this kind of thing daily with Bash. Big fanboy! Want to do something to every file in a folder?


for f in ../*.fasta


name=$(basename "$f" .fasta)

blastn -query $f -db someDB -out $name.output



Want to do something and then something else and then something else? Pipes everywhere!

blastn -query file.fasta -db someDB | cut -f1 | sort -u | grep whatelse


If then, else if then, else?


while read line


chromosome=$(echo "$line" | cut -f1)

if [ "$chromosome" = "chrX" ]


someNumber=$(echo "$line" | cut -f8)

elif [ "$chromosome" = "chrY" ]


someNumber=$(echo "$line" | cut -f9)


echo "NA"




Pros of Bash? 99% certain you're already using a Bash shell. Builtin tools are perfect for scripting and you'd be using them anyway for e.g. processing tab separated values. Syntax.

Cons? It's very slow in comparison to other languages, although this doesn't matter if you're just piping bits from one program into another. It's the execution time of the programs that matters. Syntax.


Syntax is very simple but it's easy to make mistakes like e.g. when the difference of echo $line and echo "$line" matters.

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by 5heikki8.9k
but to get faster results on a multicore machine use 'make -j N' or 'gnu-parallel'


ADD REPLYlink written 5.9 years ago by Pierre Lindenbaum129k
gravatar for smithtomsean
5.9 years ago by
United Kingdom
smithtomsean180 wrote:

We use ruffus here to pipe with python. Works very nicely for me, with a sensible system for checking file dependencies. I.e if you change your pipeline half way through or change an intermediate file, it'll work out where to start the re-analysis from automatically.

I'd echo what Cytosine says though and start from the programming language you're comfortable with.

ADD COMMENTlink written 5.9 years ago by smithtomsean180
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1129 users visited in the last hour