What'S The Best Generic Scripting Tool For Bioinformatics?
4
5
Entering edit mode
10.5 years ago
Mike Sanders ▴ 220

As a bioinformatics team grows there is huge benefit from sharing code, particularly with strong coding guidelines that promote read-ability and reuse.

But can we really pick common technology, infrastructures shared with our colleagues? Generic scripting languages like PERL, Python and Groovy are great, and each come with different strengths for common file and data marshalling or common mathematically/statically data field manipulations.

Other less generic tools, like R, KNIME, Galaxy take us down deeper applications for data analysis without becoming science application specific. And often it is specific tools for the instrumentation that is more efficient for targeted work. What is a bioinformatics team to do?

Thoughts?

software • 6.9k views
11
Entering edit mode
10.5 years ago

Use the best tool for the job? Have people around that know Perl or Python and R, are very efficient under Linux and can learn to use new tools fast. The problems that are found within the field of bioinformatics are very diversified and you won't come close to anything that could be regarded as 'the best generic scripting tool'. Cheers

7
Entering edit mode
10.5 years ago
Ketil 4.1k

I like to split bioinformatics in two: The development of algorithms and tools that perform some specific analysis, and scripting the various tools together for some specific project. Tools usually require high performance, and are consequently often written in C, sometimes in C++, occasionally Java. Scripting requires rapid development, and typically use Python or Perl.

This distinction is not a hard and fast line, so I think an ideal scripting language should support vertical integration - it should offer both rapid development and high performance, and thus be possible to use in the whole chain. I, of course, like to see Haskell as covering this niche, and thus ideal for bioinformatics, but I guess it's still somewhat esoteric. It is more likely that the current situation will evolve by incorporating tools as libraries in the scripting languages, much like Matlab is dog slow but uses algorithms from well tuned C libraries to do the heavy lifting.

Visualization of results might constitute a third category. I tend to pipe output data to gnuplot or dot, but sometimes you need something more interactive or flexible. The important criterion is then likely to be the available libraries.

4
Entering edit mode

I think that trifecta is missing biology and statistics.

2
Entering edit mode

Agreed. The trifecta of bioinformatics: Tools, shell/linux knowledge, and python/perl/other.

0
Entering edit mode

Excellent: that is consistent with what I have seen: instrument vendor tools, followed by common server tools, and some pretty strong sharing in perl/python. In less academic environments there are some commercial tools such as generic analysis pipe-lining.

0
Entering edit mode

I would say luajit offers the best combination of rapid development and high performance. You can write much shorter code which achieving an efficiency close to C, sometimes.

0
Entering edit mode

I would say luajit offers the best combination of rapid development and high performance. You can write much shorter code while achieving an efficiency close to C, for some tasks.

0
Entering edit mode

Python, with the ability to compile to C code with the Cython package (just requires adding type declarations to vars) seems like a very good choice for high productivity with optional high performance.

4
Entering edit mode
10.5 years ago

I think Perl combined with the Bioperl toolkit is still very relevant. Python and Biopython are picking up a lot of momentum.

1
Entering edit mode

Bioperl continues to be loved! Yet I do like the read-ability of some of the other tools. Thanks for your help!

3
Entering edit mode
10.5 years ago

Many well respected bioinformatics labs are choosing R. The following institutions have individuals that are deeply dedicated to developing R packages (often in C and C++ inside) for use in bioinformatics scripting:

• Fred Hutchinson Cancer Research Center
• Harvard Medical School
• National Cancer Institute
• European Molecular Biology Laboratory

Here is why I think R is so good for scripting in bioinformatics:

1. Operators and functions are vectorized
2. For example: + in Perl of Python will only work on two objects. In R, c(1,2,3) + 10 produce 11 12 13 which makes more sense in a scientific setting. No need to write for loops which are super slow in a scripting langage. In R the vector operation happens on the C-level so it will zoom through 10 million elements in an instant.
3. Abundance of high-quality statistical and analytics libraries
4. CRAN has 2983 packages
5. Bioconductor has 460+ packages
6. Documentation is superb
7. Obligatory for every function in every package
8. Examples are plenty and relevant
9. It's easy for biologist to use (now especially that there is RStudio)
10. Run command interactively
11. Produce superb graphs with ggplot2
12. Export results as a spread-sheet
13. Nothing is a black box.
14. Type the function name without the () and you will see the source code
2
Entering edit mode

Even though I use R often, I disagree. R is by far one of the worst computer languages I've ever seen and also it's extremely inefficient. Building a language like R would fail you in any computer 101 course. In my case is the other way around, I'm seeing less people using R. It was used a lot when micro-array experiments were popular, but not so much for sequencing analysis.

1
Entering edit mode

@Pablo, you mentioned that you write code in R. Could you provide some links to your code (e.g. on GitHub)? I want see how you are using R - that would perhaps explain why you hate it so much.

1
Entering edit mode

When your tasks can be organized into numerical vectors, R is not too bad, but otherwise, R is easily the most inefficient scripting language I have ever used, frequently ~10X slower than python/perl. In addition to doing statistics, we also process text files and prototype non-numerical algorithms. R fails on both and probably more. If one only has time to learn one scripting language, python is by far a better choice. Its numerical package numpy is also decent, I believe (I do not use python).

1
Entering edit mode

R is very different. If you are a superb Python, Ruby, or Perl programmer, you are very likely to initially write R code ~10X less efficient in speed, readability, and # of lines. You need to avoid for-loops, use vector operations, do data transformations, use logical vectors. Those are some things that make R a very foreign language.

1
Entering edit mode

Statistical capabilities are certainly an R strength, also its 'high level' abstraction (fit a linear model, rather than calculate sums of squares) and facilities (via packages, vignettes, versions,...) for (more) reproducible research. Volume of data, technical artifacts, experimental design, etc all make our work highly statistical; higher-level reasoning means taking bigger intellectual strides with each line of code; inherent collaboration, especially between folks with disparate areas of expertise, puts a premium on reproducibility.

0
Entering edit mode

PS: I am not saying R is useless. I just think R is not a proper scripting language for general purposes.

0
Entering edit mode

PS: I am not saying R is useless. R is very powerful for statistics and plotting. I just think R is not a proper scripting language for general purposes.

0
Entering edit mode

R is very different. If you are a superb Python, Ruby, or Perl programmer, you are very likely to initially write R code ~10X inefficient in speed, readability, and # of lines. You need to avoid for-loops, use vector operations, denormalize data, use logical vectors. All of those these things make R a very foreign language.

0
Entering edit mode

R is very different. If you are a superb Python, Ruby, or Perl programmer, you are very likely to initially write R code ~10X inefficient in speed, readability, and # of lines. You need to avoid for-loops, use vector operations, do data transformations, use logical vectors. Those are some things that make R a very foreign language.

0
Entering edit mode

Many things cannot be organized into vectors. The significant inefficiency of R is largely due to its crappy implementation. If some capable programmers implemented it from scratch again, it should have similar efficiency to perl/python. Nonetheless, capable programmers would rather choose a language with a decent design rather than R.

0
Entering edit mode

Many things cannot be organized into vectors. The significant inefficiency of R is largely due to its crappy implementation. If some capable programmers implemented it from scratch again, it should have similar efficiency to perl/python. Nonetheless, capable programmers would rather choose a language with a decent design instead of R.