Hi All!
I am a newbie in bioinformatics and I want to learn about it.
One question I have is do bioinformaticians really need to learn programming languages? And if so, why? From one side I see a lot of posts saying that there are lots of softwares out there that allow you to perform bioinformatics analysis and that "reinventing the wheel" is dangerous. From another side, I see many people saying that knowing programming languages are important for bioinformaticians.
I would like to know if there are already many programs and softwares that do the bioinformatics analysis, then why does a bioinformatician need to learn a programming language to develop a program or a software ? [I assume the need for learning programming is to develop a software or program]
Thank you!
You can consider the software that is out there as your basic toolbox with hammer, screwdriver, monkey wrench, saw, whatever. You can apply those tools to your data (raw) which then will give you more data (processed). E.g., run bwa on your reads and reference turns you .fastq raw reads into a .sam file of aligned reads. Now, what do you do with those aligned reads? If you're doing something simple, you might load them into a genome viewer and maybe that is already enough to see what you're looking for. When dealing with huge amounts of data in a research environment, you will end up with a pile of processed data that you need to sift through. Do you want to do that manually? Inspect each and every single pair of your millions of aligned reads individually? Or your blast hits? Do you want to always run every tool manually? Those are some of the situations where it comes in handy to know a little bit of scripting. The idea is that you extend the existing set of tools with specialised scripts or programs in order to solve the problem you're working on. The idea is not that you have to develop new generally applicable tools unless you're interested in doing so (or it is part of your job). Since you have tagged this question with "R": R packages provide solutions for specific problems (e.g. differential gene expression), however, you still need to write a couple lines of code to use those solutions, which is also a kind of coding.
In general, you should know a scripting language (Python, Perl, Ruby being the most common), bash scripting and bash command line tools in addition to the typical bioinformatics tools (although tools can always be learned on the job, in my opinion). There is no need to learn more about languages such as C/C++, Java, etc. unless you want to look into developing efficient bioinformatics tools (but then it is not just the languages, then you also need to learn algorithms and data structures). Just for data analysis, these latter languages/concepts/skills should not be required.
Thank you :)
I just need two clarifications:
If I only want to do bioinformatics analysis, I need to know a bit of the programming languages needed to run the softwares just to be able to use those softwares to analyse the data. Correct?
Under what circumstances does one need to develop a software or an algorithm and new tools? For some jobs, I see the job description says that the successful candidate must know programming languages to develop tools and softwares?
Thank you :)
For analysis you just need to know enough to a) run the various analysis tools efficiently (by building pipelines e.g. via bash scripting (assuming you're not using a platform like Galaxy)) and b) to "filter" and format your results into something usable.
Unless you're in an algorithm-heavy lab (usually in a Computer Science or Math setting) you will probably not develop the next super-fast, super-accurate short read aligner, variant caller. But, remember, algorithms are just recipes or lists of steps to perform in order to achieve something (e.g., you sort a list by performing certain steps in a structured fashion), so even with building some simple pipeline you will be developing one (mind you, it might not be the most efficient and fast one, but it nevertheless counts as one). The circumstances for developing some new software are usually that there is no (easy-to-use, easily accessible) solution available for a certain problem. For instance, I am currently developing a pipeline for assessing a specific gene family in plants, using specific exome captures and PacBio sequencing. I am tying various tools together with a Python script and develop methods in Python that do not exist for my specific problem. Ultimately, I will end up with a new software tool, even if it is just another pipeline with custom-built components under the hood. Hope that makes sense.
The main reason you need to learn a programming language for bioinformatics, I'd say, is that any given work flow is an exercise in connecting the dots. You feed files from program A -> B -> C and so on. However, much of the file conversion needed to go between programs is most easily accomplished if you're handy with Bio(perl|python), bash, awk, and so on.
Much of the 'heavy lifting' of an analysis workflow might be handled by a self contained program, where you don't need to interact with it very extensively, but there will always be intermediate bits that you have to get your hands dirty with.
Thank you :)
I just need two clarifications:
If I only want to do bioinformatics analysis, I need to know a bit of the programming languages needed to run the softwares just to be able to use those softwares to analyse the data. Correct?
Under what circumstances does one need to develop a software or an algorithm and new tools? For some jobs, I see the job description says that the successful candidate must know programming languages to develop tools and softwares?
Thank you :)
You could have added this supplementary question in original post rather than posting this 3 times.