I will be giving introductory courses for Linux, bash, Python, R, a bit of Perl, among others, in the coming semesters. The courses are aimed at graduate students doing genomics or population genetics research projects.
I will start the courses with a general overview of what Linux is about but then I need to be able to stress why it could be important to learn Linux and programing skills in order that the students gain more independence when doing their genomics analyses.
So, I would greatly appreciate your input on the following question:
> Why should someone doing a genomics project ever want to learn Linux?
I have quite a few philosophical, technical and practical justifications in mind, but I would like to know what your opinion is. You can also tell me why you think it is non-essential if this is your opinion.
simple to install and use software development tools (gcc, g++, python, perl)
On linux they are all installed and configured with one click.
multiple versions of a program can be installed by the user himself and switched on/off with sourcing some scripts without being administrator. On windows I always had to change the path in a very, very small textfield to which I had to click about 4 times.
a lot of good scientific software is written in a non portable way for linux/unix (almost all short read aligners, samtools). This makes it necessary to use Unix for genomics.
X windows: work on a powerful server and have the GUI on your thin client
Why should someone doing a genomics
project ever want to learn Linux?
Put simply, using anything else hinders your research and provides competitors using UNIX a distinct advantage. Without question, the best tools available in this field are open source tools that are largely written for POSIX systems. Yes, you can adapt these tools to Windows environments with Cygwin/VMWare, but part of being a scientist is knowing what the best equipment is for the experiment at hand.
The way I would formulate this is that unix like systems were designed to operate via action words that can be chained into 'sentences', whereas graphical operating systems like Windows present actions as fixed tasks that are easy to discover (right click shows them all) but cannot be easily shared, repeated, modified or chained into more complex tasks.
Data analysis in general (and bioinformatics in a particular) are domains where we need to express our goals in very detailed and nuanced ways and we need the type of functionality that a GUI based system lacks.
The statements above apply in general to other GUI vs command line discussions as well.
One reason people use Linux is that there is an abundance of programs and libraries for bioinformatics written for Linux, like the EMBOSS suite and BLAST. Linux gives the user complete control over their system, and is thus easy to extend already existing software for new uses. Another benefit of using Linux is access to Bash, a very powerful command-line that can be used to create pipelines of multiple programs and their outputs. However, using Linux is not essential, as any Unix based system will operate in a similar way(OS X)
Linux-based Systems are the operating systems of choice when it comes to remote computing. You can easily give commands via ssh. Remote file systems via sshfs/samba/ftp can be mounted into the system to occur as local drives. Forwarding the X-Server allows you to continue your work from any (Linux) computer.
These points are even more important for distributed computing. Most cluster software is Linux centred and I believe developing tools for MPI (for instance) is best done in a target environment - namely Linux.
Furthermore, the remote access also works in a offline way. For sure you can remember the last time you gave some advice to your Windows-using family member. The typical desperate telephone hotline: Click here, click there, click on Options - oh, there is no field names Options. What is the last entry? Quit? No the other one. Configure?... And so on, and so forth. In Linux you would just do a ./whatever -o thathelps
PS: I love all the small tools for Linux which make life so much easier (grep, cat, text processing, file conversions, batch jobs and piping,..). Starting in Linux at the beginning was quite hard, very soon my productivity started to be much higher than in Windows.
You may also mention how difficult it is to write high-performance programs for Windows. Surely it can be done, but sort of a nightmare, especially for C programmers (C++ is better supported in Windows). In addition, a few years ago, some core library routines, such as memory management, were substandard in comparison to the Linux equivalence. This is why most high-performance programs only work in Linux.
It is interesting how no one points out how much of an active choice it is to decide to learn using Linux compared to trying to stick to the old OS you were born with. There are of course some essential scientific reasons to be using it and these have been already exposed here. But discovering the open-source world, is for me, one of the most important rewards in learning Linux. Additionally, the open-source model clearly corresponds to the scientific approach: sharing results and methods for others to build on, in order for them to develop their own results and methods.
Discovering this might lead you to:
develop the reproducibility of your experiments (automatic pipelines with all your custom parameters)
contribute to and develop open-source tools for the community
contribute to the spreading of open-source tools in the community
learn how to take the best advantage of your hardware by controlling your OS
I will end by saying that taking the time to learn Linux has been by far one of my best 'career' choices.
Bioinformatics algorithms are often run on server farms ("in the cloud") for high MIPs processing. Writing applications to run on such servers is easier on a POSIX system.
I would recommend your students use OS/X because :
it is the world's most popular end-user Unix system and has the highest ease of use especially for use by non-computer scientists.
OS/X is written on BSD, the most recognized unix kernel in professional server environments.
Because OS/X is a commercial mass-market unix, the most frequent problem of open source Unix (Linux or BSD) is avoided: the system hardware is 100% supported by the software without any device driver problems.
more of their bio peers will be using OS/X (check the counts in the audience when at a conference).
there is no such thing as "running Linux", only "running a GNU/Linux Distribution" and the choice of distribution is a big decision itself with limitations and fragmented user bases of those environments. Arch vs Gentoo vs Ubuntu vs Redhat Enterprise vs Fedora vs SUSE vs just forget it.
Even after choosing the GNU/Linux distribution, there is a still the big decision of which desktop environment to choose and fragmented user bases of those environments. Gnome vs KDE vs XFCE vs etc etc etc just forget it.
I am for sure going to get negative votes for the above opinion (and likely some might rail -- incorrectly -- about the financial costs of "free" vs "paid" unix)
"Because that's the tool you're giving them."
Figure out for yourself whether you're teaching about bioinformatics tools or if you're doing Linux advocacy.
I'm a Unix user myself (OpenBSD in my case), but I would never put a Unix box in the hands of someone who is more proficient with Windows than with Unix, unless there was some other reason for them to be using Unix (tools, funding, ability to collaborate etc.).
Also, I wonder if there's anything called "Linux programming skills"? Perl, Python, C, C++ and most other languages can be programmed on most types of operating systems.
Do you have to justify the topic and methods you choose? I would try to avoid to start the prototypic flameware in the lecture. In comparison, in a lecture of protein structure you would not have to explain why you present the Photo System I but not a RNA-Polymerase... I would say: you decide, you are the chef, and basta.
@cjt The courses are about bioinformatics, but I am going to spend a lot of time on teaching Linux and everything will be done in Linux afterward. I feel compelled to justify such a choice since it does require quite an effort on the part of the students. Cheers
Linux specifically, or UNIX more broadly?
@Andreas I'm not trying to do Linux advocacy, except that it seems the only powerful enough option to me... I also did not talk about 'Linux programming skill'. I mentioned 'Linux AND programming skill' Cheers
@Casey. Let's say that 'UNIX compatible systems' are great because the UNIX philosophy was great, but if I expect people to be able to use a UNIX compatible system quickly, I'm sure going to go for Linux using a Ubuntu distro. My objective is very practical. I'm not trying to turn anybody into a hardcore UNIX geek, only to give them powerful and flexible tools and teach then how to use them for advancing their projects.
Ah, sorry for mis-reading.
If it is the only viable option that you can see, it implies that you already know how to justify it.
@Andreas. Well, more or less. As I mentioned, I do see reasons, but when you fall in love with Linux, the big reason that matters in the end is loving to do your job every day just because I can use bash/Python/Perl etc. :) I can't expect them to understand that feeling from the start. I also don't want my argument to sound like that. That is the reason why I ask for the resourcefulness of this forum :)
Sjeez and all these answers and comments are from the same people that complain that a basic R question is not bioinformatics (sorry couldn't resist).
Thanks a lot people! I have to give the right answer to somebody so I give it to the most popular answer. But keep the suggestions coming! Cheers