Question

Forum:Things you wish you knew when entering the bioinformatics field

27

Entering edit mode

4.8 years ago

jared.andrews07 ★ 19k

As the title states, what are things that you wish you knew when you first started doing bioinformatics work? This can range from pitfalls that you later learned to avoid, tips and tricks that make life as a bioinformaticist easier, resources that have proved especially helpful, particular philosophies/modes of thinking that have been beneficial, etc. This can range from analysis to software development work.

A few things I can remember off the top of my head:

Resources like Biostars and StackExchange sites that many people don't know exist.
The benefits and proper use of version control (git/Github).
Environment management (conda, docker, etc).
Attempting to reinvent the wheel rather than modifying/utilizing existing tools & ecosystems (Biopython, Bioconductor, etc).
The entire concept of scope management, and by association, feature creep.
Analysis paralysis and the recognition that no analysis is perfect.
The value of exploratory analyses and proper QC.
- "Wading" through the data is not a waste of time.
Recognizing that no analysis can supplant proper experimental design.
Analysis documentation is as critical as experimental documentation.
Allot at least twice the amount of time for a given analysis as you expect you'll need.
- Setting realistic time tables in general.
Some analyses just will not work out, just as some experiments do not. These are often not personal failures, just the nature of science.
GIGO. Garbage in, garbage out.
- An analysis on garbage data will result in garbage results/conclusions. "Do what you can" is what you hear before sound science dies.
If you can't interpret the output of an analysis, you likely aren't prepared to perform it properly.

A few other things after a bit more thought:

Data munging is critically important, time-consuming, and ultimately boring. Sticking to pre-defined formats wherever possible make this much less annoying and improve interoperability.
For the love of the science gods, please do not create your own custom file format - there is almost certainly something out there that fits your purpose.
Don't use a complicated 3000-watt Heavy-Duty Demolition Concrete Jackhammer when a simple hammer will suffice.
Where possible, keep things simple. Performance takes a backseat to usability and clarity, particularly for scripting purposes.
Learn to use tests if doing any development.
Patience. You will not become a master over night, just as you weren't performing complex experiments before you learned to even pipette.

I welcome any and all other thoughts and contributions.

pitfalls advice • 5.6k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 4.8 years ago by jared.andrews07 ★ 19k

2

Entering edit mode

Related: Advice For Newcomers To The Bioinformatics Field

ADD REPLY • link 4.8 years ago by Medhat 9.8k

0

Entering edit mode

Ah, I searched and somehow didn't stumble upon that. Thanks! An updated revisit may be useful regardless.

ADD REPLY • link 4.8 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

Ya, It is very old, definatly needs updates.

ADD REPLY • link 4.8 years ago by Medhat 9.8k

2

Entering edit mode

There is an old post with some good advice (it is more about how to manage a bioinformatics position or career, and less about technical stuff): A guide for the lonely bioinformatician.

For a core bioinformatician, in addition to the technical-side suggestions above, here are some "mind-frame" tips:

don't get upset by lack of acknowledgement or authorship
do your best to analyse bad data / bad experimental designs, but
stand your ground when asked to perform incorrect analyses
never say "I told you so" after being ignored (not because people don't deserve, but because they take it badly)
keep improving your skills, don't get too comfortable doing the the same thing over and over

As repeatedly stated, communication skills are very important, but often bioinformaticians are very poor at it - continuously try to improve it, be it to better convey technical aspects of the work, be it to improve your interpersonal relations.

ADD REPLY • link 4.8 years ago by h.mon 35k

score 8 · Accepted Answer · 2020-09-21

Don't panic - just because a tutorial/program/script seems like gibberish or way too complicated, does not mean you won't be able to understand, apply or even develop something yourself. Just like you would with any new lab method, tackle this systematically and by seeking out expert help. The learning curve may be steep in the beginning, but it probably was so in the lab, too.
Do your homework. I.e. read up on basic concepts, perhaps even an Intro into Computing, such as this one by David Evans. Definitely read everything by Mike Love and Rafa Irizarry that you can get your hands on, e.g. their excellent Introduction to Data Science
Use the command line (and whatever other programming language you will stick to) as much as possible in your day-to-day tasks to get used to it. Maybe work through the Korf Lab's primer for Linux.
Never ever use a white space in a file name.
Start small.
Read other people's scripts and source code. Github is fantastic for this; many journals also require that authors submit their entire code for all analyses presented in paper -- choose a paper that is relevant to your field and really try to understand every single line of code. You will learn so much, e.g. common practices and solutions, and that code is usually written by other people (and not by God-like entities)
Follow Pierre Lindenbaum's advice.

score 7 · Accepted Answer · 2020-09-21

7

Entering edit mode

4.8 years ago

JC 13k

I just add "communication skills", I often need to explain bioinformatic concepts to developers or biologists, each one need a particular way to understand things.

ADD COMMENT • link 4.8 years ago by JC 13k

score 6 · Accepted Answer · 2020-09-21

Doesn't quite fit the bill of "things I wish I knew before" but,

A general-purpose scripting language you are comfortable tackling most problems in.

Its great to know a bit of bash and awk and R etc., but its worth getting really comfortable with a language you can confidently do (almost) anything in. For me that's python followed by bash, but I could in principle do basically everything in python I guess.

score 6 · Accepted Answer · 2020-09-22

In no particular order, and with understanding that I am repeating some parts of other people's advice:

If you don't already have some biology background, try to learn it. This goes both ways. Bioinformaticians can do their job without knowing biology, just like biologist can do theirs without bioinformatics. Still, it is easier to appreciate whether a hypothesis generated by bioinformatics is actually testable if one has some wet-lab background. In my experience, people who are knowledgeable in both fields tend to come up with better lab and console experiments, and better interpretations of both types of results.
Learn the meaning of statistical significance and biological plausibility of a given hypothesis. Modern web servers allow us to generate a bioinformatics hypothesis in half an hour of playing with the data. It is important to know which of these hypotheses are worth exploring further, both by computer and in the lab. The examples are too numerous to pick just one, so I will link this paper instead.
With few exceptions, almost everything that you will need as a bioinformatician has most likely been done by someone else, and likely better than most of us can do. No matter how much we may think that our projects and our problems are unique, that is rarely the case. It is easier to spend a day or three studying literature and GitHub, or searching even obscure Google discussion groups to find a solution. I often look at tens of programs that I have written 15 years ago that are useless to anyone but me. For almost all of them there is an equivalent that is better documented, faster, or both. It counts for very little that I have written programs in Basic, Fortran, (Turbo)Pascal and C-64 native assembler when hardly anyone can compile them these days. And trust me, I am not beating away suitors interested to read through that code.
I do not advocate learning only one programming language, as different problems require different tools. Still, I think there should be one programming language - at most two - where we are most comfortable and can solve almost any problem. I am much happier since I took up Python about 5 years ago. Up until that point, I was solving problems through a combination of languages mentioned above, plus C and Perl, and that often lead to solving a problem in one language and then redoing it in another couple of years later because I comment the code poorly. At least now others can read and use my programs.
To repeat a sage advice from above: never use a white space in file names. Or (semi)colons. Remember that slash is for general things, and backslash for "special" things. Remember that a computer doesn't know anything about special locations where you like to keep your programs - you must add them to the $PATH. Remember your frustration when a program goes for hours without printing any progress messages, and add those informative messages to your own programs.
Consider that your greatest claim to fame may not be your biggest scientific discovery, but actually creating something that other people find useful. I often find that software used by most people is not necessarily the best by objective metrics. Instead, it is free and readily available, or open-sourced, or well-documented with lots of real-life examples, or frequently updated. Sometimes all of the above.

score 6 · Accepted Answer · 2020-09-24

As a bioinformatician you have a (currently) very valuable skill-sets. People will seek you out, hoping that you'll do their analysis for them. I have been burned by this because people don't know what you can and cannot do, so they come to you with insufficient data, with a vaguely defined biological question or no question at all ('we have this Illumina-only genome assembly from 2012 we've always wanted to publish, but....'), expect you to do wonders ('Bioinformatics gets Nature papers, right? just do your computer magic!'), and so on.

I wish I would've known how to recognise time-wasting projects so I could just say no, I've wasted too much time. There are some questions you can ask though!

What's your biological question?
What's the story of the paper? Can I see a draft?
What's your budget?
What's the timeline?
What do you think my deliverables are?
Do you have funding to co-fund a PhD student/Postdoc who will do this analysis?

and surely many more. Sometimes people who cannot answer these questions satisfactorily will still lead to great papers and collaborations. But IMHO it's not the norm.

score 4 · Accepted Answer · 2020-09-22

4

Entering edit mode

4.8 years ago

Lluís R. ★ 1.2k

One thing is missing is to check and double check the data. If you make any assumption check them!

Order of files: check them, it might be alphabetically ordered or numerically.
Database entries: check there aren't any mistakes/typos, you don't want to discover this 1 year later or more.
Number of samples: Check that any filter/quality process didn't remove a sample or that it did so on the right ones
Names of samples: A missing dash, 0 or a slightly different character can select a different sample

ADD COMMENT • link 4.8 years ago by Lluís R. ★ 1.2k

2

Entering edit mode

Absolutely true. The importance of never blindly trusting that your script/command did what you expected it to do cannot be overstated.

Related to this: Never, ever believe a result that is either too convenient, too clear-cut or too confusing/unexpected. These should be immediate red flags for you to go over your analysis until you find the bug. There is always a bug. It is not a matter of "if", but of whether it impacted your conclusions.

ADD REPLY • link 4.8 years ago by Friederike 9.0k

score 4 · Accepted Answer · 2020-09-22

For the wet/dry lab hybrids (or in general how things imho should go):

Make sure that you validate your main computational findings with experiments. It is tempting to waste a lot of time confirming results in silico, tring different tools, pretty plots, everything wrapped into a nice Markdown, one-click reproducibility etc. Still, eventually this is all meaningless if you cannot show with credible experiments that your in silico predictions actually hold true. Say you find gene signatures suggesting that certain pathways drive your phenotype. If a knockout experiment of your main candidate (say a transcription factor that is on top of the hierarchy) does not create any phenotype, well, then your nice and super reproducible Markdown is nice to have, but worthless in terms of actually explaining biology.

On the other hand, if your experiments confirm your data, but your code is messy, spread out over a dozen scripts on two workstations and you even lost the scripts for the low-level processing, well, it is going to be hard to actually reproduce your analysis.

Take-home message: Learn how to organize your code. Comment as much as possible in your scripts. Workflow managers can help, but are not a must (imho) if your scripts are proper, maybe with Git-based version control, and backed-up somewhere. Run your analysis to create hypothesis, then be sure to validate it. Be sure to have a robust finding first, you can then still continue to make all kinds of nice support figures, but you need a solid biological finding that is backed up with string experimental evidence (beyond the experiments that created the input for the in silico analysis). Pretty figures alone do not make a story unless you are a large consortium that can compensate the lack of biological findings by providing a huge amount of data that the community might use in the future as reference datasets.