Forum:Will Python Take The Place Of R?
12
19
Entering edit mode
9.1 years ago
Medhat 9.3k

Day after day more statistics packages are added to the Python language: "RPy , Statsmodels , StatPy etc...", making it more countable and robust in the field of statistics. But will it take the place of R?

By the way, this is not a comparison at all or R vs Python geeky talks, what I am discussing here is that the investment in the statistics package production in Python will lead that python alone without R be sufficient to bioinformatics needs? Also, shall we need to invest more time in the statistics package in Python, or as some answers here shows that no way we will not get the perfect combination of "scripting + statistics language" in Python ?

### Update

The post is old but today I found this article which also helps a lot

Choosing R or Python for data analysis? An infographic

Python overtakes R, becomes the leader in Data Science, Machine Learning platforms

biopython statistics r python Forum • 37k views
3
Entering edit mode

[R] as a language has its own syntax. As many people like R syntax I doubt that python will replace [R].

4
Entering edit mode

Ok, but the new generation who did not learn R will not need to learn both R and Python I think they just will need to learn Python

5
Entering edit mode

Python is great, but its functionality (and the functionality of other scripting languages) does not completely overlap [R]. Learn both. If you hate R, learn Octave, Matlab, or Maple. Or you could implement and validate complicated statistical algorithms yourself.

5
Entering edit mode

I think the last part is bad advice. Do not implement statistical algorithms yourself, not even the ones you think are reasonably straightforward. The libraries that are there have been tested more extensively than anything you write most likely ever will, i.e. it will have no or very few bugs.

6
Entering edit mode

Some statistical algorithms are simple. It is sometimes a good idea to reimplement them, for better understanding the pitfalls and for dropping unnecessary dependency. In addition, most 3rd-party libraries are written by fellow programmers no better than you and me. They make mistakes, mistakes that can exist for years because we all take them for granted. My favorite example is matrix multiplication in Ruby and a few other programming languages (e.g. Clay). It is widely known that the second matrix needs to be transposed for better cache efficiency, but in the Ruby library, the developer uses the much slower method. I would think such an obvious flaw should have been identified much earlier, but it is still there because no one bothers to read the source code (EDIT: or because no one implements matrix multiplication these days to learn the transpose trick). Except a small fraction of really high-quality or widely used libraries, most others are not trustworthy.

2
Entering edit mode

My personal confession here is that every time I try to implement a statistical method and I read more about it I realize that I do not know how to do it correctly. Or what correct actually means, or wether I should even worry about it. What I mean here that even the simplest task can become a little scary if one wanted to generalize. An example: what is the right way to compute something as simple as a sum.

3
Entering edit mode

The more we implement, the clearer we are about correctness and the more likely we can apply similar techniques to complex practical problems. For those who work on method development, reimplementing simple methods is very important. In addition, if you do not know how to implement sum correctly, many others will not know, either; some of them may be even unaware of numerical stability at all.

4
Entering edit mode

I was being cynical :-). Michael Schubert is correct.

2
Entering edit mode
1
Entering edit mode

I really hate R's syntax, but as long as Python doesn't have an equivalent of ggplot2 to create nice plots, I will stick with R for most of the data visualization tasks.

0
Entering edit mode

The KDNuggets site is so ugly it is surprising one can read an entire article there - anyway, its poll is biased and not of much value, I would think.

And are those charts from the linked article Excel charts??

1
Entering edit mode

:D for sure it is Excel

1
Entering edit mode

Excel is the future of bioinformatics

0
Entering edit mode

R is principally utilized for measurable examination while Python gives a progressively broad way to deal with information science. R and Python are conditions of the workmanship as far as programming language situated towards information science. Learning them two is, obviously, the perfect arrangement.

0
Entering edit mode

What is measurable examination?

39
Entering edit mode
9.1 years ago

No, nothing of sorts could ever happen.

1. R has more advanced statistical functionality than Python will ever have - the packages that you list implement a tiny subset of what already exists in R
2. R has better visualization capabilities than Python will ever have
3. R has a better cross platform compatibility than Python will ever have
4. R has better automated package installation than Python has (and likely will ever have)
5. The userbase for R as a statistics language is gigantic compared to the number of users that use Python for data analysis

The downside of R is that it is both eclectic and byzantine.

Python is a generic programming language and it is great at that. But it is not a data analysis platform nor are the lead developers focusing on addressing the issues above.

And I am saying this as someone that uses Python almost exclusively for data analysis and most of my work.

11
Entering edit mode

1b. Bioconductor: http://www.bioconductor.org/

8
Entering edit mode

+1 I agree completely with the list, I consider R a must-know language for any bioinformatician.

8
Entering edit mode

First, right off the bad when someone is making sweeping statements about the future "will ever have", it's a bad sign. It's almost impossible to predict what the future will hold. There's no fundamental reason why Python couldn't be better in every category listed in 10 years; unlikely perhaps but possible.

Second, this person has a completely different definition of "Python" in the context of "Python vs R" then... well, the rest of the world. From several comments in the original post and in follow-ups, they clearly don't regard pandas, numpy, etc, as part of Python for the purposes of this comparison. This makes no sense, but I won't even go there. The question that everyone is interested is in whether the Python numerical/statistical "stack", if you prefer the term, can replace R. So this is really seems to be answering a different question.

Third, some of their statements are highly subjective. If 'statistical functionality' is exclusively algorithms produced by the statistics community, then the statement is closer to being true. But R is not the language of choice in machine learning, electrical engineering, physics, etc departments. All these fields make substantive contributions to statistical functionality - in the case of ML I would argue their contributions to practical statistics is pretty comparable to statistics. Yeah, R has great visualization. Quick question: how can I zoom in and out of a plot I make in R? Oh right, it's pointlessly difficult... something that's a complete joke in python and matlab.

Some reasons Python might take over R:

• No answer from R to ipython, which is probably the best tool for combining code, graphics, and explanation around
• Far better grid computing functionality through ipython and also through sockets and shareable data structures built into Python's standard library
• Far better ability to bind with other languages. My personal favorite is the incredible power of boost.python.
• Rpy2 lets you call any R package directly from python, possibly largely negating R's only real advantage (it's larger library)

While I think it is unlikely that Python will replace R anytime soon (R is just too entrenched in the stats community), I really hope it does. R is a messy, garbage programming language and I'd just as soon never touch it again.

0
Entering edit mode

As Yogi Berra said: It is difficult to make predictions, especially about the future. Thus when talking about the future one needs to look in a realistic context and not an absolute one.

I for one (and I assume many others) interpret the original question of Python replacing R with respect to the future of a reasonable human timeframe that would affect an individual's work and career in bioinformatics! Say 10 years - that is a very long time especially in Bioinformatics where The Dog Years Of Bioinformatics apply.

Just consider that Python 3.0 was released in 2008 and six years later its penetration into science (and in general) is still extremely low and most specialized software may never (time frame as above) be ported over to it.

Python 2.7 is already more popular then R in absolute terms, it is more useful than R in countless of application domains. Plus I fully agree that R is a on old and crufty language with a flawed design. Still when it comes to data biological data analysis where statistics are an essential requirement, it won't be replaced by Python any time in our career time.

2
Entering edit mode

If I may add my testimony from 3 years later: In the last 3 years, I completely switched from python 2 to python 3, I learned a bit of R, used it for a few months, and, given the pain it was to do some programming with it, I finally decided to learn pandas and rpy2 to avoid having to deal with R as much as possible. I would say that the bigger problem is that I still need to use the R interactive interpreter to experiment with bioinformatics packages before managing to run them from rpy2, and also when I want to get help about those packages. I would be much happier if a python version of them was available.

If I had time to learn new languages, I would rather try scala and julia than investing time in R.

3
Entering edit mode

+1. R is not going anywhere anytime soon.

3
Entering edit mode

1 and 5 are obvious points, but 2-4 are debatable. R has better native support for stats plots, but that does not mean that all visualisations are better. Mind elaborating on the platform compatibility? IMO, setuptools is far more powerful than R's built-in software installer, but R's repositories are more integrated and centralised. In your comment you do not count numpy as python but Bioconductor is a point for R. I'm not saying that R is not the more obvious choice being a DSL and given the sheer number of packages available (it is!) but the answer seems a bit biased. I would also like to see an example for the "100 vs 1 line", I don't think that's true.

3
Entering edit mode

From personal experience of teaching python programming to beginners - installing something as simple as matplotlib causes endless troubles to over half the audience. Some library or some dependency is always missing. Ironically the only platform that the installation is seamless is Windows but even there one needs to make sure to download the right library that is labeled exactly with the same version number as the python that is installed, plus makes sure to download the right binary version AMD vs Intel. And when it fails it means that this person has no visualization capability whatsoever.

The equivalent in R is install.packages("foo") and that's it, works the same way on all platforms.

There is a good blog post by Titus Brown: http://ivory.idyll.org/blog/2013-swc-arizona.html that also noted how python software installation is and I quote: "a total disaster". Even though they are using software pre-packaged by a company. The catch is that when you are good at it and notice problems early on it all seems to operate like clockwork.

1
Entering edit mode

I won't call matplotlib a simple package to install.

Also, in my experience, I have some trouble with R package that require some external dependency.

1
Entering edit mode

I am not as well versed with R as python. I can see that the available packages makes R indispensable as a statistical language. However, are there anything native to R that makes it ideal for data/statistics? If we are just comparing the R and python languages, are there syntactical advantages to R?

4
Entering edit mode

R has a large number of syntactical constructs that one can use to build incredibly powerful (and often maddeningly difficult to understand) constructs. Using the "apply", "by" and "sweep" combined with the automatic function broadcasting allows one to write one liners that do things that would take hundreds of lines of Python to implement. You can do similar things with NumPy or Pandas but note how that is not Python anymore, one would need to learn the NumPy/Pandas specific terminology that is radically different from Python and not all that different from R.

For an example of R at best see the Plyr pacakge:

1
Entering edit mode

linear algebra: syntax for multiplying matrices. Although I am sure there is a python lib for this.

0
Entering edit mode
0
Entering edit mode

Excellent points Istvan.

2
Entering edit mode

I think 2 3 and 4 are plain false and 5 is questionable.

0
Entering edit mode

Interesting that in 2021, every one of those 5 points outlined is not only questionable but entirely and flat out wrong. (2, 3, and 4 were already vastly questionable already circa 2016 and 1 probably already around late 2017)... Also very intriguing that no one ever wants to talk about R's terrible compilation problems and its ancient and horrible resource management.

1
Entering edit mode

Eight years is a lifetime in bioinformatics. That said, R probably still has better visualization options than Python. Plotly is still no replacement, even in 2021, and matplotlib is still incomplete enough that a library had to be written to glue ggplot2 onto it. Python is only usable for science because of the efforts of people behind numpy and scipy; underneath, without those core libraries, it still remains a slow environment for parsing and processing bioinformatics data.

0
Entering edit mode

Interestingly I would say that every single point is still valid as it were in 2021. And again I am saying this as someone that works primarily in Python thus I don't consider myself being biased towards a language I know far better.

10
Entering edit mode
9.1 years ago

Many statistics packages (and elaborate pipelines, APIs, etc.) are also still being added to R regularly.

Over the past ten years I have watched R grow from something only statistics professors used to a fairly ubiquitous tool being used by bioinformaticians, biologists, clinicians, etc. It seems to be under active development. R 3.0.0 was just released. It has a GUI version that works nicely in Mac or Windows and the text based session will work in Mac or Linux/Unix. I have encountered surprisingly few cross platform issues over the years.

With the addition of 'Rscript' a while back, it is now trivial to automate R analyses (including figure generation) in a pipeline. The barrier to entry as either an R developer or R user seems to be lower than ever, although R maintains a well deserved reputation for having a steep learning curve.

For all these reasons, I don't see R going away anytime soon...

6
Entering edit mode
9.1 years ago
bluewoodtree ▴ 60

Python (with NumPy, SciPy, and StatPy) already has a big share among data analysis software. You can almost find equivalent functionality to Matlab/Octave and R, however those Python tools are still a little bit in their infancy. I mean, R and Matlab/Octave exist for many decades and have been originally geared to those data analysis functions...and of course they developed over the years to become even better. Python's data analysis capabilities are quite new, and it might take a while until they are on the same level, or become even better.

But I am very optimistic that Python will evolve to be one of the best data analysis packages one day. The Python community is very enthusiastic, creative, and productive, and in my opinion it is just a matter of time. However, I think R and Matlab/Octave will never cease to exist. They will find their niche, just like Fortran & Co.

6
Entering edit mode
8.2 years ago
Xingyu Yang ▴ 280
Search bioconductor. I don't think python have any chance to take it place.
5
Entering edit mode
9.1 years ago

If you implement a new bioinformatics/biostatistics algorithm I think Python gives much more flexibility in programming. It is easier to implement those algorithms in Python since it is a general purpose language and it has a nice syntax, lots of useful language construction. R is pretty bound to table data manipulations but the base of statistics algorithms in it is really impressive.

So people often use combination of R/Python (like here http://cistrome.org/Cistrome/Cistrome_Project.html). When they use Python for algorithm implementations, input/ouput manipulations and R for plotting, running statistics or for Bioconudctor packages.

1
Entering edit mode

totally agree +1

5
Entering edit mode
8.2 years ago
Carlos Borroto ★ 2.0k

I believe Python will take over. It won't be easy as there is a lot convincing to do and algorithms to port. The no so secret weapons Python has are Pandas and the IPython Notebook.

Take a look at this video introduction and see if you agree with me.

10-minute tour of pandas

Feel free to follow on using the available notebook viewer.

2
Entering edit mode

Really I am using both now some times R become annoying but my conclusion is that they will live side by side :)

1
Entering edit mode

You have been proven very much correct

4
Entering edit mode
8.2 years ago
pld 5.0k

I am not sure if anyone feels this way, but I have always had a serious issue with R. I realize python was purpose-built with the concept of "least suprise", but R seems to have been designed with the concept of "world's least consistent language". I've never had such a wrestling match with getting ontop of the syntax, especially with bioconductor where it seems like every package is a separate beast with its own syntax. At least to me, it seems like I get one bioconductor package down and that information serves no purpose in figuring out the next one. Vanilla R is one thing, but it seems like the groups and people who implemented their respective R packages made no effort to implement any sort of uniformity or consistency. I've seen some of the R code some staticians write and it would make even a CS101 student cringe.

If anything it is a case of the typical traditionalism seen in science that slows the adoption and migration to more effective methods. R is a big ol' kludge that people keep using because the last group used it.

I could be wrong but I've never had this experience with anything from C to Perl to Haskell.

Python's scientific computing (message passing, statistics, numerical methods, etc) is constantly growing but they're still a long way away from matching the level of features that R offers. SciPy and NumPy are great but it seems like growth there has slowed. They're also fairly bair-bones for doing some of the more complex stuff. NumPy is very fast, I'm really happy to see the use of vectorized matrix and vector operations. IMO BioPython is an overweight mess, but that is more about gripes with overzealous OOP.

2
Entering edit mode

Bioconductor is an "organic" product that is a bit like a field tended by hundreds of gardeners, each with their own plot going on their merry way. Some plots flourish, many others are abandoned or produce incorrect or inconsistent results and there is little indication when that happens

Knowing R/Bioconductor is more about knowing what does not work and steering away from these latter. If I try to rely mostly on R after a week or so I get mad - it just does not fit my brain and the inconsitencies and lost opportunities irk me to no end. My defensive mechanism is to use as little R as possible and do all programming in python. Once the data is in a simple matrix like structure things work well.

3
Entering edit mode
8.2 years ago
Chris Fields ★ 2.2k

I don't think any one language will completely dominate, ever. There are too many people with too many varying opinions on what makes a programming language good or bad. It's akin to always choosing French over German cuisine, Buffalo Trace over Builleit, vi over emacs, etc. It's a personal, almost artistic choice.

And really, why would you want to know one language anyway, particularly in this field (which has a lot of Python, Perl, R, Java, JS, C, C++, Ruby, etc etc etc) floating around? Sure, one may have a favorite, but I would find it incredibly boring if there was only one choice.

3
Entering edit mode
8.2 years ago
Ann ★ 2.3k

No, because of what Istvan said.

And also my own experience:

I used to use python all the time for just about everything. But then I got much better at R programming after reading lots of Bioconductor vignettes and also Bioconductor Case Studies. Once I learned how to use named vectors and lists like dictionaries, I pretty much stopped using python for anything except scripting. And then I learned how to use bash shell scripting more effectively and stopped using python even for that. That said, whenever I want to write a simple command line tool, I always write it in python because I can count on python being available on most systems.

However now that I've heard about the iPython notebook, I might start using python more often.

3
Entering edit mode
7.1 years ago

No- at least not for the foreseeable future. I echo the sentiments and good points of Ann, Istvan and Malachi and many others on this thread.

The chief reason is because of the existing ecosystem that exists in R not just for data analysis, but also for statistically-informed analysis of biological datasets. Python can't match what R has in place, which includes numerous packages that are being produced daily. The best example of this is the Bioconductor project, which is extremely well maintained and offers a number of biologically-informed data structures.

The second reason, which hasn't been touched on as much here, is due to R's capabilities for data visualization. R is unquestionably superior to python in dataviz, especially packages like lattice and especially ggplot2. I don't see R losing its lead to python in this area either.

2
Entering edit mode
4.7 years ago

I disagree with the entire concept of even comparing these two. They both have advantages in different areas. A skilled and experienced analyst will know the situations in which both are best applied. I will even throw JAVA, Perl, and C/C++ into the fray here - all excel in certain areas and the ideal situation is to have skills in all, which only comes with years of experience.

1
Entering edit mode

That's all very well*, but it's hardly going to keep the flamewars burning

* and I completely agree

1
Entering edit mode

I'm going to throw a whole lot of fuel into this forest fire here -- even Excel has its place in a bioinformatician's toolbelt.

1
Entering edit mode

If you use it, be very careful: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80

When it isn't irreversibly corrupting datasets, Microsoft software also pollutes text files with line endings that trip up casual users of command-line tools.

1
Entering edit mode

One of my fav articles to point to, too. Bioinformatics is the knowledge of how to use various tools combined with the experience of when not to use them.

2
Entering edit mode
4.6 years ago

Well, years after the original question, I think this is worth a revisit. While I agree with Kevin's answer fully, it's interesting to see how python has progressed in recent years to catchup with R on the statistical and visualization side of things. As others have mentioned, numpy, pandas, and scipy yield a huge amount of flexibility to python in terms of data manipulation and statistical analysis. It's true that it doesn't have many of the purely 'omics-based packages that R does, but more and more are being ported to python.

Really, the languages complement each other very nicely, in my opinion. Data munging and handling common file formats is easy in python, particularly with pysam, pybedtools, pyvcf, and pandas, and Rpy2 allows you to access those powerful R stats/modeling packages from within python.

I feel confident in saying python matches R's visualization capabilities at this point in time. I've never had a moment where I felt I had to go to R to create the figure I want. The creation of seaborn and plot.ly allow the creation of high-quality, interactive figures very easily without having to fiddle with matplotlib parameters much (if at all). Couple these with ipython and you've got some really interesting ways to explain and interactively wade through your data.

Python package installation has also come a long with with the advent of pip, anaconda, and bioconda. Similar to R, nearly every python package is a one-line install. This seemed to be a big complaint of many people 4 years ago, but it's been largely resolved now.

I don't see R going anywhere due to all of the packages made specifically to handle analysis of sequencing experiments, network interactions, etc. Python can do those things perfectly well, but why reinvent the wheel when you can pull it out and stick it on your own car whenever you need?

Overall, I feel python has established itself as an important player in bioinformatics for years to come. Part of this is due to its incredibly easy to pick up syntax, general flexibility, and extremely active developer base. I personally hate R as a language, but there's no denying its status as the backbone of statistical analysis in the bioinformatics community. Of course, that doesn't mean we have to interact with it anymore than is necessary or that python won't continue to make advances to help bridge the gap.

1
Entering edit mode

I still think it's a matter of personal choice. No matter how much Python may "progress", so long as it uses the horrible white-space sensitive syntax, it will remain a horror to stubborn old school C-style programmers such as myself. Unless there is something worth the switch, python would not be my first choice for anything even moderately heavy-weight.

1
Entering edit mode

As a Java developer for me it was horrible at first look without semicolon and curly braces, but after a while I adapted (but still don't like the idea of spaces) and I was impressed by productivity level.

0
Entering edit mode

I mean, I can program in Python alright. I just prefer R because I don't have to jump through hoops to unwrap a dataset.

1
Entering edit mode

Oh, it definitely is. No reason to limit the toolkit. It's funny you mention the syntax, as I find bracket-based syntax infinitely more annoying visually and while writing. I get the advantages, but it just causes headaches for me.

I actually agree with your last point, I just find python a lot simpler to write. You can actually get the best of both worlds if you wrap C libraries and use cython to speed up the intensive tasks. R does the same thing with Rcpp. Things like the boost.python library and SWIG make it easy to wrap existing C code with python - see pysam/pybedtools. Plus pandas/numpy do quite well performance-wise. Again, just depends on preference. I know that learning how to do such things in python will take much less time than my trying to learn the idiosyncrasies of C/Java, so it's what I tend to stick with.

1
Entering edit mode

True - ease of learning is what gives the edge to Python. It's the reason why I have it on my tool belt too - learning it felt like a minimal effort thing. I am not against switching to Python, I just don't see the exclusive need for it, and I cannot actively find reasons to switch to it because of the barrier I mentioned.