Question: How To Decide Which Software To Use For Building Automated Ngs Analysis Pipeline
gravatar for Mcmahanl
9.5 years ago by
Mcmahanl300 wrote:

There are so many software available for building automated NGS analysis pipeline, how one decide on which one to use. For examples, listing below are some of the software I have come across:

  1. CloVR

  2. The ngs analysis pipeline part of the NGS information management and analysis system for Galaxy

  3. Nextgen in Bioperl (bioperl ngs new features mentioned in ISMB2010 poster and bioperl generic wrappers for external programs

  4. Taverna workflow management system (webinar video, eGalaxy)

  5. PaPy:A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines

  6. Pipeline manager feature of the BioHPC

  7. SVMerge pipeline for integrating variant callers (also Somatic SNiPer pipeline)

  8. Integrated services for genomic analysis (ISGA)

  9. Cyrille2 pipeline system

  10. Conveyor: a workflow engine for bioinformatic analyses

  11. Pipeline section in Oxford Journals (Bioinformatics) list of key papers relating to all aspects of NGS

And any other software that I should look into?

next-gen pipeline sequencing • 9.3k views
ADD COMMENTlink modified 8.2 years ago by Jeremy Leipzig19k • written 9.5 years ago by Mcmahanl300

Sometimes the solution is to be be aware of only one these ;-) . Now in all seriousness thanks for listing all these examples, this will be a great resource.

ADD REPLYlink written 9.5 years ago by Istvan Albert ♦♦ 84k

Check out

ADD REPLYlink written 3.6 years ago by egafni30
gravatar for Farhat
9.5 years ago by
Pune, India
Farhat2.9k wrote:

I have experience with shell script based pipelines and Galaxy. While Galaxy provides a great front end for making pipelines, I have found it slower for running the tasks. One serious drawback with Galaxy is that it stores results at every intermediate step in all their full uncompressed glory. This, I am sure, partly accounts for the slowdown as the disk writing activity is heavy. Also, it leads to filling up of drives which can be an issue in itself especially if you are doing a lot of analyses.

Shell scripts can be really flexible and powerful but they are not as user-friendly although I am sure any kind of scripting language could deliver similar results.

ADD COMMENTlink written 9.5 years ago by Farhat2.9k
gravatar for Ryan Dale
9.5 years ago by
Ryan Dale4.9k
Bethesda, MD
Ryan Dale4.9k wrote:

I write many of my NGS pipelines using Ruffus. It's really easy to run tasks in parallel. Simple pipelines are correspondingly simple to write, but at the same time it's rich enough to support very complex pipelines, too (e.g.,

ADD COMMENTlink written 9.5 years ago by Ryan Dale4.9k
gravatar for lh3
9.5 years ago by
United States
lh332k wrote:

From what I have heard, for NGS Galaxy is the most widely used generic pipeline. Nonetheless, I guess more people are building their own pipelines from scratch. IMHO, the difficulty of using generic pipeline comes from the difference between parallelization environments. It is pretty easy if everything runs on the same node, but LSF/SQE/PBS and the different configurations (e.g. memory and runtime limitation) make things messy.

If you are the only users of your cluster and have full control, using a generic pipeline may be not hard. A friend of mine builds a private cloud and uses Galaxy. Everything runs smoothly. If you are using nodes part of a huge cluster, probably writing your own pipeline is easier. When you can control your pipeline, you can also avoid inefficient parts easily as is mentioned by Farhat. You know, implementing an initial pipeline is not that difficult. It will take time to purify it, but the same is true if you use generic pipeline frameworks.

ADD COMMENTlink written 9.5 years ago by lh332k
gravatar for Ketil
9.5 years ago by
Ketil4.0k wrote:

I don't know what you mean by "NGS analysis", but coming from a comp.sci. background, I tend to use 'make' to construct non-trivial pipelines. For our current de novo project, the current pipeline consists of primary assembly (newbler, celera and CLC), secondary assembly (SSPACE), remapping of reads (bwa index, aln, and sampe), quality evaluation (samtools idxstats and flagstas), generating graphs (gnuplot) and so on.

Since many of these steps are time consuming, and since something always fails at some point, make's ability to skip already completed files saves a lot of time, especially with some careful sprinkling of .PRECIOUS. Also, make's -j option means that my pipeline is trivially parallelized.

The downside is that it can be a bit hard to debug, but -r --warn-undefined-variables helps a bit. I'm still missing some way of separating output from subprocesses, especially when running -j 16 :-)

ADD COMMENTlink written 9.5 years ago by Ketil4.0k

i've always wondered how people use MAKE this way. do you have validator scripts that can distinguish when an output file is not just garbarge?

ADD REPLYlink written 9.5 years ago by Jeremy Leipzig19k

Most of my output files are garbage. Maybe I should switch to Haskell.

ADD REPLYlink written 9.5 years ago by Jeremy Leipzig19k

Why would an output file be garbage? If anything, make helps against this, since it deletes temporary files when something goes wrong.

But yes, validating results is always a good idea.

ADD REPLYlink written 9.5 years ago by Ketil4.0k

:-) Yes! Unfortunately, the old rule of garbage in - garbage out is universal, and independent of implementation language...

ADD REPLYlink written 9.5 years ago by Ketil4.0k
gravatar for Ben Lange
9.5 years ago by
Ben Lange190
Minneapolis, MN
Ben Lange190 wrote:

I have extensive experience with a large custom developed pipeline. Using a database to coordinate tasks on a private pool of commodity PCs. This approach is very flexible but exposes lots of coordination complexities with more complex workflows.

It's pretty clear that the more forking and joining you have in your workflow, the higher the complexity regardless of your approach.

Focus on what costs the most. When you're dealing with large amounts of data, storage is cheap but accessing and moving it is not. So the more localized the data is to the compute nodes the faster the throughput.

ADD COMMENTlink written 9.5 years ago by Ben Lange190
gravatar for Jeremy Leipzig
8.2 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

We have been building some genotyping pipelines in Pegasus, which is a very heavyweight platform for scientific pipelines, and is apparently NSF-funded through the 2016 Olympics in Rio. Plan accordingly.

Pegasus is very friendly with Condor, although it can be run on other batch systems with some headaches.

The nodes look like this (i've stripped away the angle brackets to conform to BioStar)

job id="ADDRG_01" namespace="align" name="java" version="4.0"
            -jar ${picardfolder}/AddOrReplaceReadGroups.jar 
            TMP_DIR =${picardtemp}
        stdout name="${filename}.picardrg.out" link="output"/
        stderr name="${filename}.picardrg.err" link="output"/
ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Jeremy Leipzig19k

sorry about that ;-) we need to fix XML display pronto - I'll make this a priority

ADD REPLYlink written 8.2 years ago by Istvan Albert ♦♦ 84k

also the main reason this has not been done so far is that I don't think I understand all the implications of properly escaping HTML, nor the conditions in which it should or shoulnd't happen, plus the escaping needs to interact with the prettyfier, also not obious how to do it, - thus I am afraid that I will open a javascript injection security hole with it

ADD REPLYlink written 8.2 years ago by Istvan Albert ♦♦ 84k

ok thanks - whatever helps people post more code is always good

ADD REPLYlink written 8.2 years ago by Jeremy Leipzig19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1433 users visited in the last hour