Forum: Good Habit for Bioinformatics Analyst or Scientist
gravatar for Shicheng Guo
4.7 years ago by
Shicheng Guo8.5k
Shicheng Guo8.5k wrote:

Hey colleagues,

Summary some good habit in our research. I have been hit by the project badly since some bad habit, such as:

1, Record everything in a project in one systemic page, such as Wiki or Evernote, so that you can check them easily. Never try to remember everything if you put them everywhere.

2, Save all the data which you were used to make the figure, since sometimes boxplot will be change to violin plot or heatmap plot or bee swarm plot. You will never know which is the prefer for your boss or reviewer. If you don’t save them, maybe you need to re-built the data again.

3, Keep the figure as PDF forever, you know, JPEG, TIFF, PNG is not what you need in the publication.

4, Use Adobe illustrator, Never Never Never use Photoshop.

5, Learn to use ggplot2, it would be more fast to prepare Figures if you master it compared with R plot.

6, Build your own function (Perl, R, Python) library/packages. Compile and Use them for next time. Don't write them again and again.

7, Upload the code to github or gitlab, share with yourself and others.

8, record all the method, idea, process, procedure and pipelines in mediawiki and shared with your lab-mates

9, Save the fastq to SRA/GEO or wig to UCSC so that we don't need spend extra money after we complete the project

10, The code or script by non-professional stuff/student would be horrible, Majority of them will have some bugs, be careful, asking help for code review from colleagues would be good habit.

11, how to prepare your manuscript and the efficiency: link: the best habit to prepare manuscript

12, Time Management Strategies and Advice for Bioinformaticians: Link here

13, Build your own bioinformatics server and assemble all the platform your need and your own pipeline.

14, Arial for font in the Fiugre, never use red-green combination, never use rainbow color scale, Font size:8pt

15, Never never make your script running for 12 hours (especially in PBS), split them into many pieces within 2 hours. You boss will be in the trouble if you meet bugs for several times.

16, try to use Anaconda data science platform and assemble the tools what you prefer as a uniform platform.

17, fork and help to make your frequent software more powerful in github

18, check the positive and negative control for each computational analysis, so that find all bugs in the beginning.

19, maintain your blog/make md5sum label for each your own database

More suggestion?

scientist habit forum analyst • 6.4k views
ADD COMMENTlink modified 3.9 years ago • written 4.7 years ago by Shicheng Guo8.5k

Curious why you dislike Photoshop? I do most of my figure creation in GIMP, so it isn't vector based like AI is. But i've never had any problems with it.

ADD REPLYlink written 4.7 years ago by Sinji3.0k

Photoshop is really only appropriate for editing images of gels, or things like that. For generating or editing other types of plots, which should be scalable and vector-based, Illustrator (or Inkscape, etc) is the right tool.

ADD REPLYlink written 4.7 years ago by Chris Miller21k

Yep. I generate almost all my plots in R, including very complex ones. But for final polishing, required figure dimensions, dpi, color profile - I also use Photoshop.

ADD REPLYlink written 4.7 years ago by Biomonika (Noolean)3.1k

Isn't Illustrator better for editing vector graphics like PDF? Just curious why Photoshop...

ADD REPLYlink written 4.7 years ago by fanli.gcb710

Photoshop and Illustrator are both $29.99 a month. Meanwhile, e.g. GIMP and ImageMagick are FOSS.

ADD REPLYlink written 4.7 years ago by 5heikki9.1k

and Inkscape as a direct Illustrator alternative!

ADD REPLYlink written 4.7 years ago by Daniel3.8k

If you're a student, it's 20 bucks for everything :)

I've been on that deal for what seems like most of my adult life :P As great as GIMP and ImageMagick are (and ImageMagick is particularly good with the command + extensions like montage), once you learn where everything is in Photoshop and Illustrator, there's really no competition. I mean, GIMP and IM are really good considering they're totally free - but I think you get what you pay for with Adobe's Creative Cloud. You even get cloud storage and some other perks (like your username/password dumped online every now and again...heh).

But the best thing about going Adobe is that there are online guides/tutorials for just about everything. I had a particularly tricky issue the other day involving intersecting two SVG heatmaps, which i could 'solve' in Illustrator in about 10 minutes thanks to a guide someone made in 2001 :P

ADD REPLYlink written 4.7 years ago by John12k
gravatar for Devon Ryan
4.7 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:
  1. Use a literate programming approach, such as with R-markdown or Jupyter/ipython.
  2. Version control everything. That annotation you got from Ensembl? Yeah, you better write down which release, because the next one might produce different results.
  3. Remember backups? Yeah, make sure you have those.
  4. Clean up after yourself. Don't be the guy/gal that occupies an excessive amount of space on the "very-expensive-overly-priced-poorly-performing-storage-array" (TM).

BTW, I would skip your number 2. You need to save primary unprocessed files (e.g., compressed fastq files or BAM/CRAM files after using bamHash or similar) and anything that takes an absurd amount of time to reproduce. You also need to save anotations and anything else that will likely be different if you download it again. However, don't save the results of every step, that's just going to blow up your storage costs and make it impossible to find anything. This happens to be identical to the common practice on the wet-lab side, where freezer/-80 space is always shared and HIGHLY limited.

ADD COMMENTlink written 4.7 years ago by Devon Ryan98k

I think the OP means to save the data that was used to plot a final published figure

This is a good point - it is easy to forget to save this data or at least document it very clearly.

Usually in the heat of the moment as we are focused on the data analysis we end up with many data inputs all from the same original set but these may be filtered one way or another, and we are going back and forth between them. Two months later when the reviews come back it is not so easy to figure out which data was plotted where.

ADD REPLYlink written 4.7 years ago by Istvan Albert ♦♦ 86k

Sure, you need to document how each version of each figure is made. Ideally you just extend whatever analysis you have by creating a new file (with a new name, or with some version or date associated to it that's then represented in your documentation/code).

ADD REPLYlink written 4.7 years ago by Devon Ryan98k

This happens to be identical to the common practice on the wet-lab side, where freezer/-80 space is always shared and HIGHLY limited.

Great analogy.

ADD REPLYlink written 4.5 years ago by Gjain5.5k
gravatar for fanli.gcb
4.7 years ago by
Los Angeles, CA
fanli.gcb710 wrote:

In addition to Devon's answers above:

  1. Sanity check. If you filter a dataset, check that this actually happened! Especially with projects that are linked with scripting, it is easy for unnoticed errors and omissions to occur.
  2. Don't reinvent the wheel. Chances are whatever you want to do has been done before. Biostars, stackoverflow, and seqanswers are all great places to search first.
  3. Take a step back and look at the big picture. What hypotheses are we trying to (dis)prove? What conclusions can we draw from the data, and what potential impact could they have?
  4. In line with #3, it's really important to keep on the scientific and technical literature. Bleeding edge approaches are great, and having a feel for where to apply them is great as well.
ADD COMMENTlink written 4.7 years ago by fanli.gcb710
gravatar for John
4.7 years ago by
John12k wrote:

Things not to do:

  1. Make Quality Control plots but never really look at them.

  2. Make Quality Control plots and look at them for too long.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by John12k

Or may be keep looking at them until you have the published version of them!

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by MAPK1.7k

John. Don't tell us it is a joke. I love this joke.

ADD REPLYlink written 4.7 years ago by Shicheng Guo8.5k

I think watching QC plots is overrated and a thing of the past. I just got a dataset with over 150 samples, due to technical replicates it comes in over 600 fastq files with a FastQC generated pdf each (thankfully). Should I open all the 600 pdfs lying around on the server, by copying them to my computer, double clicking and viewing just to find the read quality is maybe ok for most and that all of the samples fail the sequence composition by position filter, like all samples before? If I need only 30 seconds per file, I will do this for 5 hours continued. I could imagine it is worth spending most 10 seconds per file, that I could do by putting all pdfs in one folder and then use the MacOS gallery view to only look at the first page of the pdf and only open the file when I spot some problem. A tabular overview would be much better.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Michael Dondrup48k

This is a perfect application for MultiQC by Phil Ewels.

ADD REPLYlink written 4.7 years ago by GenoMax94k

AfterQC is another great QC tool for fastq.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by biomaster180

I think watching QC plots is overrated and a thing of the past.

Hi- I think this is a bit harsh. Rather I would say QC tools should make the output in a form easy to tabulate so that looking at hundreds of QCs is not difficult. In your case the problem was that PDFs are pretty much the opposite of "easy to tabulate" but if you had the raw output of fastqc you could fairly easy parse the text file containing the QC metrics.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by dariober11k
gravatar for Khader Shameer
4.7 years ago by
Manhattan, NY
Khader Shameer18k wrote:
  1. Plan and manage the projects as modules: for example data clean up/QC, database management, analytics, predictive modeling, machine learning, statistical inference, data visualization and biological/clinical inference. This would help in the long-run for plug-and-play and easily build, test and deploy analytic pipelines.
  2. Assess the task: think before one spend countless hours on coding that slick function. Someone may have already made an open bio-* package for the bioinformatics task.
  3. Backup: Document, version control and backup everything (including the Linux/Unix command line using history). Bioinformatics is an applied practice and often contribute to scientific inference and clinical impact - here reproducibility is incredibly important. Tracking data provenance could help with reproducibility.
  4. Collaborate: co-create, and code-review
  5. Design thinking: Spend time to solve the task creatively, you have the choice to convert the bioinformatics task to a simple script or a package that many others could use.
  6. Engineer, don't just code. Understanding the technical details and know how to scale the systems from one data set to a 100 or 1000 data sets is key
  7. Future-proof the infrastructure - codes can crack, pipelines could break, good to have a mechanism to maintain and support the bioinformatics infrastructure
  8. Give back to the community - share code, analytics or blog. This would give more visibility and help to take the tool/paper to a large user base.
  9. Happy bioinformatics: Enjoy.
ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Khader Shameer18k
gravatar for Sean Davis
4.7 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Do as much of your work in the public eye as possible. Github and the like have changed the way that I think and work.

ADD COMMENTlink written 4.7 years ago by Sean Davis26k
gravatar for biomaster
4.7 years ago by
San Jose
biomaster180 wrote:


Talk less, code more!

ADD COMMENTlink written 4.7 years ago by biomaster180

That might work as a punchline, but a good bioinformatician discusses with collaborators, checks for existing tools and "codes" only as a last option.

ADD REPLYlink written 4.7 years ago by _r_am32k

OK but don't think less!

ADD REPLYlink written 4.7 years ago by Manu Prestat4.0k
gravatar for TriS
4.7 years ago by
United States, Buffalo
TriS4.3k wrote:
  1. when you are coding, add comments so that when you go back to it 3 months from now you remember/understand what/why you did it
  2. I use Google Slides to summarize my analysis, add thoughts and plots so that it's well organized and I can access them quickly, comment, go back to bed :)
  3. when using R, save the workspace so that you don't have to re-run the whole code when you need to go back
  4. when possible use more than one approach to analyze the data, if the result is consistent, great, if not, workout why
  5. can I mention again backups on server(s)?
ADD COMMENTlink written 4.7 years ago by TriS4.3k

I think Your No 3 is Great. I will use it later.

ADD REPLYlink written 4.7 years ago by Shicheng Guo8.5k

I always have an .RData and an .Rhistory file saved, with the workspace being the project directory. The directory itself is part of a project hierarchy, so that structures everything.

Initialize R projects with setwd(), close R sessions with save.image() and savehistory(), reopen them with load() and loadhistory() - that's my routine for every project.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by _r_am32k

Oh! On point #3 - Didn't know this could be That handy. I always select "No" as on any exit prompts. :-/ Thank you very much. Will try using it.

And, #2 definitely helpful. I create flow diagram in power point to explain pipeline/etc to my supervisor/PI.

ADD REPLYlink written 4.7 years ago by Bioinformatics_NewComer320

I am going to chime in with a disagreement on point 3. In my mind that feature is like the dark side in Star Wars or, as Yoda would say

Easily they flow, quick to join you when code you write. If once you start down the dark path, forever will it dominate your destiny. Consume you it will.

Basically reproducible analysis describes our ability to quickly reproduce a result - BUT that needs to happen from a raw data not some intermediate state that we don't quite remember how we got.

Don't get me wrong .RData and .RHistory are very useful, but as with many good things one needs to use these in moderation and understand their pitfalls.

ADD REPLYlink written 4.7 years ago by Istvan Albert ♦♦ 86k
gravatar for Ryan Dale
4.7 years ago by
Ryan Dale4.9k
Bethesda, MD
Ryan Dale4.9k wrote:

If using data from other sources, keep track of where it came from. This can be as easy as a shell script with a bunch of wget or curl lines, but such a small thing can make a big difference in a few months when you forget where you got those files.

ADD COMMENTlink written 4.7 years ago by Ryan Dale4.9k

I also store md5sum or any other details like "date downloaded", "# sequences included" about the file - public datasets like uniprot_sprot.fasta keep on changing and it it's easier to compare with collaborators if you have md5sums

ADD REPLYlink written 4.7 years ago by Philipp Bayer6.9k
gravatar for chen
4.7 years ago by
chen2.1k wrote:

Here are some of my tips:
1, sharing: make your code to be libraries, share them in github
2, visualization: always visualize your data
3, noise: keep in mind, data is always with noise, do filtering and cleaning before using them
4, git: use git to trace all your codes, manuscripts and slides
5, toolchain: maintain the tools you usually use to be a toolchain
6, testing: always test your pipelines/algorithms/tools with benchmark data

ADD COMMENTlink modified 4.7 years ago by Sean Davis26k • written 4.7 years ago by chen2.1k
gravatar for Israel Barrantes
4.3 years ago by
Israel Barrantes790 wrote:

Keep a command line history log for every program you installed/compiled, including version numbers of its dependencies. This will be really helpful not only for reproducibility, but also in case of moving up to new servers.

ADD COMMENTlink written 4.3 years ago by Israel Barrantes790
gravatar for Asaf
4.7 years ago by
Asaf8.5k wrote:

Make your projects reproducible.

ADD COMMENTlink written 4.7 years ago by Asaf8.5k
gravatar for Vivek
4.7 years ago by
Vivek2.4k wrote:

Slightly unusual but tar ball and archive all your work directories after you are done with a large project to save money on storage. Tape archives, especially the ones maintained by large genome centers are quite cheap and can be retrieved in a day or two. On the other hand you get billed heavily for data storage on file systems.

ADD COMMENTlink written 4.7 years ago by Vivek2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2069 users visited in the last hour