Justifying Learning Linux For Bioinformatics
12
21
Entering edit mode
11.6 years ago

Hi,

I will be giving introductory courses for Linux, bash, Python, R, a bit of Perl, among others, in the coming semesters. The courses are aimed at graduate students doing genomics or population genetics research projects.

I will start the courses with a general overview of what Linux is about but then I need to be able to stress why it could be important to learn Linux and programing skills in order that the students gain more independence when doing their genomics analyses.

So, I would greatly appreciate your input on the following question:

> Why should someone doing a genomics project ever want to learn Linux?

I have quite a few philosophical, technical and practical justifications in mind, but I would like to know what your opinion is. You can also tell me why you think it is non-essential if this is your opinion.

Cheers

linux subjective • 14k views
2
Entering edit mode

"Because that's the tool you're giving them."

Figure out for yourself whether you're teaching about bioinformatics tools or if you're doing Linux advocacy.

I'm a Unix user myself (OpenBSD in my case), but I would never put a Unix box in the hands of someone who is more proficient with Windows than with Unix, unless there was some other reason for them to be using Unix (tools, funding, ability to collaborate etc.).

Also, I wonder if there's anything called "Linux programming skills"? Perl, Python, C, C++ and most other languages can be programmed on most types of operating systems.

2
Entering edit mode

Do you have to justify the topic and methods you choose? I would try to avoid to start the prototypic flameware in the lecture. In comparison, in a lecture of protein structure you would not have to explain why you present the Photo System I but not a RNA-Polymerase... I would say: you decide, you are the chef, and basta.

1
Entering edit mode

@cjt The courses are about bioinformatics, but I am going to spend a lot of time on teaching Linux and everything will be done in Linux afterward. I feel compelled to justify such a choice since it does require quite an effort on the part of the students. Cheers

0
Entering edit mode

Linux specifically, or UNIX more broadly?

0
Entering edit mode

@Andreas I'm not trying to do Linux advocacy, except that it seems the only powerful enough option to me... I also did not talk about 'Linux programming skill'. I mentioned 'Linux AND programming skill' Cheers

0
Entering edit mode

@Casey. Let's say that 'UNIX compatible systems' are great because the UNIX philosophy was great, but if I expect people to be able to use a UNIX compatible system quickly, I'm sure going to go for Linux using a Ubuntu distro. My objective is very practical. I'm not trying to turn anybody into a hardcore UNIX geek, only to give them powerful and flexible tools and teach then how to use them for advancing their projects.

0
Entering edit mode

If it is the only viable option that you can see, it implies that you already know how to justify it.

0
Entering edit mode

@Andreas. Well, more or less. As I mentioned, I do see reasons, but when you fall in love with Linux, the big reason that matters in the end is loving to do your job every day just because I can use bash/Python/Perl etc. :) I can't expect them to understand that feeling from the start. I also don't want my argument to sound like that. That is the reason why I ask for the resourcefulness of this forum :)

0
Entering edit mode

Sjeez and all these answers and comments are from the same people that complain that a basic R question is not bioinformatics (sorry couldn't resist).

0
Entering edit mode

Thanks a lot people! I have to give the right answer to somebody so I give it to the most popular answer. But keep the suggestions coming! Cheers

22
Entering edit mode
11.6 years ago
Ido Tamir 5.2k

Linux is not important, Unix is.

Unix (Linux) is important because its:

• unix texttools and vim, emacs etc... one often works with text files and always has to peek a little bit (head, tail), mangle them (sort, cut, paste) etc...
• easy to build simple pipelines (awk, bash, piping, bash redirection, texttools)
• simple to install and use software development tools (gcc, g++, python, perl) On linux they are all installed and configured with one click.
• multiple versions of a program can be installed by the user himself and switched on/off with sourcing some scripts without being administrator. On windows I always had to change the path in a very, very small textfield to which I had to click about 4 times.
• a lot of good scientific software is written in a non portable way for linux/unix (almost all short read aligners, samtools). This makes it necessary to use Unix for genomics.
• X windows: work on a powerful server and have the GUI on your thin client
1
Entering edit mode

Thank you @Ido! Nice list, I will use most of that in my justification :)

1
Entering edit mode

+1 for installation/setup: package managers are maybe the reason why Linux is so much easier to use. I don't agree with the bug-ridden bloatware X windows though ;-)

0
Entering edit mode

X windows might be bloated, but I read that last bullet as more of a network model/multi-user capability. e.g. since UNIX machines are built with multi-user capability in mind, you can log in to large or small machines, one or many (assuming you have access to them), and accomplish things as needed. I do most of my work remotely through shell windows. I use giant computers even though they're not sitting on my desk. My jobs continue running after I disconnect.

14
Entering edit mode
11.6 years ago
brentp 24k

Give them a 4GB SAM/FASTQ file and have them (try to) open it in Windows.

Then have them (try to) create a new file from that with only the reads from chromosome 1.

Then do that in linux in 1 line with grep or whatever.

Then say "I rest my case".

0
Entering edit mode

Hi Brent. Toying with huge files is certainly one reason I think Linux is far superior (at least to W!nd0w5). I'll get a few examples of that type set up (data extraction, counting sequences...) to show them right away what POWER is about :P

0
Entering edit mode

I'm going to use this with my fellow "wet lab" colleagues who insist on Windows. evil grin

14
Entering edit mode
11.6 years ago

Why should someone doing a genomics project ever want to learn Linux?

Put simply, using anything else hinders your research and provides competitors using UNIX a distinct advantage. Without question, the best tools available in this field are open source tools that are largely written for POSIX systems. Yes, you can adapt these tools to Windows environments with Cygwin/VMWare, but part of being a scientist is knowing what the best equipment is for the experiment at hand.

UNIX is the best equipment.

1
Entering edit mode

Nice pitch @Aaron! Do I have the permission to quote your reply directly in class? :)

0
Entering edit mode

Sure, just correct my gammar and poor diction!

0
Entering edit mode

above=case in point.

11
Entering edit mode
11.6 years ago

Why linux for Bioinformatics ?

because "put in a database the 10 first ordered sequences from the 100 last records about rotavirus at NCBI, but not containing the word VP7 " is as simple as:

curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$(xmllint --format "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=rotavirus[ORG]&retmax=100" | grep "<Id>" | cut -d '>' -f 2 | cut -d '<' -f 1 | tr "\n" ",")&rettype=fasta" | awk '/^>/ { printf("\n%s\t",$0);next;} {printf("%s",$0);}' | grep -v VP7 | sort -t ' ' -k 2,2 | head | awk -F ' ' 'BEGIN {printf("create table if not exists DNA(name,seq) ;\n");} {printf("insert into DNA(name,seq) values (\"%s\",\"%s\");\n",$1,$2); }' | sqlite3 biostar.sqlite  because: head mybigfile.xls  because: wc youdonthavetoopenexcel.txt  (...) because:  echo SSBoYXRlIHdpbmRvd3MK | base64 -d  ADD COMMENT 2 Entering edit mode I still think if you close with "I rest my case" (and a smug look) this will work better. :) ADD REPLY 2 Entering edit mode echo TXkgZnJpZW5kIGFuZCBJIHVzZWQgdG8gc2VuZCBtZXNzYWdlcyB0aGlzIHdheSBpbiBjb2xsZWdlLg0K | base64 -d ADD REPLY 0 Entering edit mode I guess showing off is going to be part of the process :P ADD REPLY 0 Entering edit mode Will do :) But I don't want to scare them either because it looks like black magic! o_o ADD REPLY 0 Entering edit mode @Pierre SSBoYXRlIFdpbmRvd3MgYnV0IGFsc28gTWFjCg== :P ADD REPLY 0 Entering edit mode "Yvahk znxrf pbzchgvat sha ntnva :)" ADD REPLY 0 Entering edit mode ab, Revp, gung'f ebg13 - bayl hfrq ol trbpnpuref ;) ADD REPLY 0 Entering edit mode Qnza, jnagrq gb guebj va fbzr ebg13 whfg gb frr fbzrbar unq gur vqrn nyernql :-) ADD REPLY 0 Entering edit mode But that's what I call a nice explanation of ebg13: http://uncyclopedia.wikia.com/wiki/EBG13 ADD REPLY 8 Entering edit mode 11.6 years ago The way I would formulate this is that unix like systems were designed to operate via action words that can be chained into 'sentences', whereas graphical operating systems like Windows present actions as fixed tasks that are easy to discover (right click shows them all) but cannot be easily shared, repeated, modified or chained into more complex tasks. Data analysis in general (and bioinformatics in a particular) are domains where we need to express our goals in very detailed and nuanced ways and we need the type of functionality that a GUI based system lacks. The statements above apply in general to other GUI vs command line discussions as well. ADD COMMENT 0 Entering edit mode Thank you @Istvan. I like the 'words and sentences' analogy. I think I'll incorporate this to make the students understand part of the 'UNIX way'. Cheers ADD REPLY 6 Entering edit mode 11.6 years ago Following on from @Ido Tamir's list, knowing Linux allows a genomicist to: • develop a transferrable skill set that sets you apart from a wet-only biologist (having *NIX skills on your CV is an asset in the post-genomic research world) • better understand how computers and operating systems actually work • ability to run bioinformatics resources on your own machine (BLAST, GALAXY, etc) • ability to access ready-made bioinformatics computing environments (e.g. Bio-Linux) • ability to do reproducible research (BASH, R, TAVERNA, etc.) • ability to perform analyses on computer clusters (important for big/long computational jobs) • ability to access cloud computing resources (increasingly important for groups without access to HPC infrastructure) ADD COMMENT 0 Entering edit mode Many thanks @Casey, I really like your take on this. I'll try to emphasis that there are skills that the courses will bring to them and that can be transfered/applied in their future research career. Cheers ADD REPLY 4 Entering edit mode 11.6 years ago Burlappsack ▴ 680 One reason people use Linux is that there is an abundance of programs and libraries for bioinformatics written for Linux, like the EMBOSS suite and BLAST. Linux gives the user complete control over their system, and is thus easy to extend already existing software for new uses. Another benefit of using Linux is access to Bash, a very powerful command-line that can be used to create pipelines of multiple programs and their outputs. However, using Linux is not essential, as any Unix based system will operate in a similar way(OS X) ADD COMMENT 1 Entering edit mode Bad examples - both the EMBOSS suite and NCBI BLAST are also available on Windows (and Mac OS X). ADD REPLY 0 Entering edit mode Thank you, these are indeed important reasons! For the OS X and others, I guess learning Linux is an advantage then. The material the students assimilate is going to be directly transferable to their MAC boxes and they will have learned about UNIX/Linux on the way! ADD REPLY 0 Entering edit mode Ok, noted @Peter, thanks for the correction. ADD REPLY 4 Entering edit mode 11.6 years ago Cjt ▴ 370 Linux-based Systems are the operating systems of choice when it comes to remote computing. You can easily give commands via ssh. Remote file systems via sshfs/samba/ftp can be mounted into the system to occur as local drives. Forwarding the X-Server allows you to continue your work from any (Linux) computer. These points are even more important for distributed computing. Most cluster software is Linux centred and I believe developing tools for MPI (for instance) is best done in a target environment - namely Linux. Furthermore, the remote access also works in a offline way. For sure you can remember the last time you gave some advice to your Windows-using family member. The typical desperate telephone hotline: Click here, click there, click on Options - oh, there is no field names Options. What is the last entry? Quit? No the other one. Configure?... And so on, and so forth. In Linux you would just do a ./whatever -o thathelps PS: I love all the small tools for Linux which make life so much easier (grep, cat, text processing, file conversions, batch jobs and piping,..). Starting in Linux at the beginning was quite hard, very soon my productivity started to be much higher than in Windows. ADD COMMENT 3 Entering edit mode 11.6 years ago lh3 33k You may also mention how difficult it is to write high-performance programs for Windows. Surely it can be done, but sort of a nightmare, especially for C programmers (C++ is better supported in Windows). In addition, a few years ago, some core library routines, such as memory management, were substandard in comparison to the Linux equivalence. This is why most high-performance programs only work in Linux. ADD COMMENT 3 Entering edit mode 11.6 years ago It is interesting how no one points out how much of an active choice it is to decide to learn using Linux compared to trying to stick to the old OS you were born with. There are of course some essential scientific reasons to be using it and these have been already exposed here. But discovering the open-source world, is for me, one of the most important rewards in learning Linux. Additionally, the open-source model clearly corresponds to the scientific approach: sharing results and methods for others to build on, in order for them to develop their own results and methods. Discovering this might lead you to: • develop the reproducibility of your experiments (automatic pipelines with all your custom parameters) • contribute to and develop open-source tools for the community • contribute to the spreading of open-source tools in the community • learn how to take the best advantage of your hardware by controlling your OS I will end by saying that taking the time to learn Linux has been by far one of my best 'career' choices. ADD COMMENT 3 Entering edit mode 11.6 years ago Bioinformatics algorithms are often run on server farms ("in the cloud") for high MIPs processing. Writing applications to run on such servers is easier on a POSIX system. I would recommend your students use OS/X because : • it is the world's most popular end-user Unix system and has the highest ease of use especially for use by non-computer scientists. • OS/X is written on BSD, the most recognized unix kernel in professional server environments. • Because OS/X is a commercial mass-market unix, the most frequent problem of open source Unix (Linux or BSD) is avoided: the system hardware is 100% supported by the software without any device driver problems. • more of their bio peers will be using OS/X (check the counts in the audience when at a conference). • there is no such thing as "running Linux", only "running a GNU/Linux Distribution" and the choice of distribution is a big decision itself with limitations and fragmented user bases of those environments. Arch vs Gentoo vs Ubuntu vs Redhat Enterprise vs Fedora vs SUSE vs just forget it. • Even after choosing the GNU/Linux distribution, there is a still the big decision of which desktop environment to choose and fragmented user bases of those environments. Gnome vs KDE vs XFCE vs etc etc etc just forget it. I am for sure going to get negative votes for the above opinion (and likely some might rail -- incorrectly -- about the financial costs of "free" vs "paid" unix) ADD COMMENT 2 Entering edit mode Even though I disagree that OSX is the best POSIX system for a bioinformatics working environment, I've upvoted this for your guts to push the merits of OSX. While I do think OSX is good laptop environment, I don't think it is the best bioinformatics environment for beginners or workstations/servers. This is because installation and use of many bioinformatics tools requires custom compilations or work-arounds that are ultimately just a waste of time. Take for example the need to install the developer tools just to get gcc working.... ADD REPLY 1 Entering edit mode this is not slashdot. nobody reveives downvotes because s/he does not recommend "GNU"/Linux or advocates M"$" or Apple.

1
Entering edit mode

Hi @Jonathan. Thanks for your thoughts! I don't think I would personally recommend OS/X to anybody getting introduced to bioinformatics. A very practical reason for this is, I won't force them to buy a MAC for a few courses. I'll suggest they use Ubuntu. It's free, easy to try without installing, easy to install, supports an incredibly long list of hardware and has so many interesting packages ready for install. Anybody wishing to explore further the UNIX path, including BSD or OS/X can do that easily. I have seen too many mac users around me who fight with their macs to install software.

0
Entering edit mode
11.4 years ago
Guangchuang Yu ★ 2.6k

I have implemented a customized Linux for bioinformatics call LXtoo, You can find it here: http://bioinformatics.jnu.edu.cn/LXtoo/

1
Entering edit mode

If we're advertising bioinformatics orientated linux distros I'd also point you here: http://nebc.nerc.ac.uk/tools/bio-linux/bio-linux-6.0 !!