Question: What'S The Best Generic Scripting Tool For Bioinformatics?
5
gravatar for Mike Sanders
6.6 years ago by
Mike Sanders220
Victoria, BC
Mike Sanders220 wrote:

As a bioinformatics team grows there is huge benefit from sharing code, particularly with strong coding guidelines that promote read-ability and reuse.

But can we really pick common technology, infrastructures shared with our colleagues? Generic scripting languages like PERL, Python and Groovy are great, and each come with different strengths for common file and data marshalling or common mathematically/statically data field manipulations.

Other less generic tools, like R, KNIME, Galaxy take us down deeper applications for data analysis without becoming science application specific. And often it is specific tools for the instrumentation that is more efficient for targeted work. What is a bioinformatics team to do?

Thoughts?

software • 5.0k views
ADD COMMENTlink modified 2.3 years ago by Biostar ♦♦ 20 • written 6.6 years ago by Mike Sanders220
11

Use the best tool for the job? Have people around that know Perl or Python and R, are very efficient under Linux and can learn to use new tools fast. The problems that are found within the field of bioinformatics are very diversified and you won't come close to anything that could be regarded as 'the best generic scripting tool'. Cheers

ADD REPLYlink written 6.6 years ago by Eric Normandeau9.6k

Thanks Eric. Practical and real advice.

ADD REPLYlink written 6.6 years ago by Mike Sanders220
7
gravatar for Ketil
6.6 years ago by
Ketil3.8k
Germany
Ketil3.8k wrote:

I like to split bioinformatics in two: The development of algorithms and tools that perform some specific analysis, and scripting the various tools together for some specific project. Tools usually require high performance, and are consequently often written in C, sometimes in C++, occasionally Java. Scripting requires rapid development, and typically use Python or Perl.

This distinction is not a hard and fast line, so I think an ideal scripting language should support vertical integration - it should offer both rapid development and high performance, and thus be possible to use in the whole chain. I, of course, like to see Haskell as covering this niche, and thus ideal for bioinformatics, but I guess it's still somewhat esoteric. It is more likely that the current situation will evolve by incorporating tools as libraries in the scripting languages, much like Matlab is dog slow but uses algorithms from well tuned C libraries to do the heavy lifting.

Visualization of results might constitute a third category. I tend to pipe output data to gnuplot or dot, but sometimes you need something more interactive or flexible. The important criterion is then likely to be the available libraries.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Ketil3.8k
4

I think that trifecta is missing biology and statistics.

ADD REPLYlink written 6.6 years ago by Vince Buffalo460
2

Agreed. The trifecta of bioinformatics: Tools, shell/linux knowledge, and python/perl/other.

ADD REPLYlink written 6.6 years ago by Sequencegeek730

Excellent: that is consistent with what I have seen: instrument vendor tools, followed by common server tools, and some pretty strong sharing in perl/python. In less academic environments there are some commercial tools such as generic analysis pipe-lining.

ADD REPLYlink written 6.6 years ago by Mike Sanders220

I would say luajit offers the best combination of rapid development and high performance. You can write much shorter code which achieving an efficiency close to C, sometimes.

ADD REPLYlink written 6.6 years ago by lh330k

I would say luajit offers the best combination of rapid development and high performance. You can write much shorter code while achieving an efficiency close to C, for some tasks.

ADD REPLYlink written 6.6 years ago by lh330k

Python, with the ability to compile to C code with the Cython package (just requires adding type declarations to vars) seems like a very good choice for high productivity with optional high performance.

ADD REPLYlink written 6.3 years ago by Samuel Lampa1.1k
4
gravatar for 2184687-1231-83-
6.6 years ago by
2184687-1231-83-4.8k wrote:

I think Perl combined with the Bioperl toolkit is still very relevant. Python and Biopython are picking up a lot of momentum.

ADD COMMENTlink written 6.6 years ago by 2184687-1231-83-4.8k
1

Bioperl continues to be loved! Yet I do like the read-ability of some of the other tools. Thanks for your help!

ADD REPLYlink written 6.6 years ago by Mike Sanders220
3
gravatar for Aleksandr Levchuk
6.6 years ago by
United States
Aleksandr Levchuk3.1k wrote:

Many well respected bioinformatics labs are choosing R. The following institutions have individuals that are deeply dedicated to developing R packages (often in C and C++ inside) for use in bioinformatics scripting:

  • Fred Hutchinson Cancer Research Center
  • Harvard Medical School
  • National Cancer Institute
  • European Molecular Biology Laboratory

Source: http://www.bioconductor.org/about/core-team/

Here is why I think R is so good for scripting in bioinformatics:

  1. Operators and functions are vectorized
  2. For example: + in Perl of Python will only work on two objects. In R, c(1,2,3) + 10 produce 11 12 13 which makes more sense in a scientific setting. No need to write for loops which are super slow in a scripting langage. In R the vector operation happens on the C-level so it will zoom through 10 million elements in an instant.
  3. Abundance of high-quality statistical and analytics libraries
  4. CRAN has 2983 packages
  5. Bioconductor has 460+ packages
  6. Documentation is superb
  7. Obligatory for every function in every package
  8. Examples are plenty and relevant
  9. It's easy for biologist to use (now especially that there is RStudio)
  10. Run command interactively
  11. Produce superb graphs with ggplot2
  12. Export results as a spread-sheet
  13. Nothing is a black box.
  14. Type the function name without the () and you will see the source code
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Aleksandr Levchuk3.1k
2

Even though I use R often, I disagree. R is by far one of the worst computer languages I've ever seen and also it's extremely inefficient. Building a language like R would fail you in any computer 101 course. In my case is the other way around, I'm seeing less people using R. It was used a lot when micro-array experiments were popular, but not so much for sequencing analysis.

ADD REPLYlink written 6.6 years ago by Pablo1.8k
1

@Pablo, you mentioned that you write code in R. Could you provide some links to your code (e.g. on GitHub)? I want see how you are using R - that would perhaps explain why you hate it so much.

ADD REPLYlink written 6.6 years ago by Aleksandr Levchuk3.1k
1

When your tasks can be organized into numerical vectors, R is not too bad, but otherwise, R is easily the most inefficient scripting language I have ever used, frequently ~10X slower than python/perl. In addition to doing statistics, we also process text files and prototype non-numerical algorithms. R fails on both and probably more. If one only has time to learn one scripting language, python is by far a better choice. Its numerical package numpy is also decent, I believe (I do not use python).

ADD REPLYlink written 6.6 years ago by lh330k
1

R is very different. If you are a superb Python, Ruby, or Perl programmer, you are very likely to initially write R code ~10X less efficient in speed, readability, and # of lines. You need to avoid for-loops, use vector operations, do data transformations, use logical vectors. Those are some things that make R a very foreign language.

ADD REPLYlink written 6.6 years ago by Aleksandr Levchuk3.1k
1

Statistical capabilities are certainly an R strength, also its 'high level' abstraction (fit a linear model, rather than calculate sums of squares) and facilities (via packages, vignettes, versions,...) for (more) reproducible research. Volume of data, technical artifacts, experimental design, etc all make our work highly statistical; higher-level reasoning means taking bigger intellectual strides with each line of code; inherent collaboration, especially between folks with disparate areas of expertise, puts a premium on reproducibility.

ADD REPLYlink written 6.6 years ago by Martin Morgan1.5k

PS: I am not saying R is useless. I just think R is not a proper scripting language for general purposes.

ADD REPLYlink written 6.6 years ago by lh330k

PS: I am not saying R is useless. R is very powerful for statistics and plotting. I just think R is not a proper scripting language for general purposes.

ADD REPLYlink written 6.6 years ago by lh330k

R is very different. If you are a superb Python, Ruby, or Perl programmer, you are very likely to initially write R code ~10X inefficient in speed, readability, and # of lines. You need to avoid for-loops, use vector operations, denormalize data, use logical vectors. All of those these things make R a very foreign language.

ADD REPLYlink written 6.6 years ago by Aleksandr Levchuk3.1k

R is very different. If you are a superb Python, Ruby, or Perl programmer, you are very likely to initially write R code ~10X inefficient in speed, readability, and # of lines. You need to avoid for-loops, use vector operations, do data transformations, use logical vectors. Those are some things that make R a very foreign language.

ADD REPLYlink written 6.6 years ago by Aleksandr Levchuk3.1k

Many things cannot be organized into vectors. The significant inefficiency of R is largely due to its crappy implementation. If some capable programmers implemented it from scratch again, it should have similar efficiency to perl/python. Nonetheless, capable programmers would rather choose a language with a decent design rather than R.

ADD REPLYlink written 6.6 years ago by lh330k

Many things cannot be organized into vectors. The significant inefficiency of R is largely due to its crappy implementation. If some capable programmers implemented it from scratch again, it should have similar efficiency to perl/python. Nonetheless, capable programmers would rather choose a language with a decent design instead of R.

ADD REPLYlink written 6.6 years ago by lh330k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1383 users visited in the last hour