Forum: Good Habit for Bioinformatics Analyst or Scientist
gravatar for Shicheng Guo
11 months ago by
Shicheng Guo4.2k
Shicheng Guo4.2k wrote:

Hey colleagues,

Summary some good habit in our research. I have been hit by the project badly since some bad habit, such as:

1, Record everything in a project in one systemic page, such as Wiki or Evernote, so that you can check them easily. Never try to remember everything if you put them everywhere.

2, Save all the data which you were used to make the figure, since sometimes boxplot will be change to violin plot or heatmap plot or bee swarm plot. You will never know which is the prefer for your boss or reviewer. If you don’t save them, maybe you need to re-built the data again.

3, Keep the figure as PDF forever, you know, JPEG, TIFF, PNG is not what you need in the publication.

4, Use Adobe illustrator, Never Never Never use Photoshop.

5, Learn to use ggplot2, it would be more fast to prepare Figures if you master it compared with R plot.

6, Build your own function (Perl, R, Python) library/packages. Compile and Use them for next time. Don't write them again and again.

7, Upload the code to github or gitlab, share with yourself and others.

8, record all the method, idea, process, procedure and pipelines in mediawiki and shared with your lab-mates

9, Save the fastq to SRA/GEO or wig to UCSC so that we don't need spend extra money after we complete the project

10, The code or script by non-professional stuff/student would be horrible, Majority of them will have some bugs, be careful, asking help for code review from colleagues would be good habit.

11, how to prepare your manuscript and the efficiency: link: the best habit to prepare manuscript

12, Time Management Strategies and Advice for Bioinformaticians: Link here

13, Build your own bioinformatics server and assemble all the platform your need and your own pipeline.

14, Arial for font in the Fiugre, never use red-green combination, never use rainbow color scale, Font size:8pt

15, Never never make your script running for 12 hours (especially in PBS), split them into many pieces within 2 hours. You boss will be in the trouble if you meet bugs for several times.

16, try to use Anaconda data science platform and assemble the tools what you prefer as a uniform platform.

17, fork and help to make your frequent software more powerful in github

18, check the positive and negative control for each computational analysis, so that find all bugs in the beginning.

19, maintain your blog/make md5sum label for each your own database

More suggestion?

scientist habit forum analyst • 3.2k views
ADD COMMENTlink modified 10 weeks ago • written 11 months ago by Shicheng Guo4.2k

Curious why you dislike Photoshop? I do most of my figure creation in GIMP, so it isn't vector based like AI is. But i've never had any problems with it.

ADD REPLYlink written 11 months ago by Sinji2.1k

Photoshop is really only appropriate for editing images of gels, or things like that. For generating or editing other types of plots, which should be scalable and vector-based, Illustrator (or Inkscape, etc) is the right tool.

ADD REPLYlink written 11 months ago by Chris Miller18k

Yep. I generate almost all my plots in R, including very complex ones. But for final polishing, required figure dimensions, dpi, color profile - I also use Photoshop.

ADD REPLYlink written 11 months ago by Biomonika (Noolean)2.9k

Isn't Illustrator better for editing vector graphics like PDF? Just curious why Photoshop...

ADD REPLYlink written 11 months ago by fanli.gcb590

Photoshop and Illustrator are both $29.99 a month. Meanwhile, e.g. GIMP and ImageMagick are FOSS.

ADD REPLYlink written 11 months ago by 5heikki6.3k

and Inkscape as a direct Illustrator alternative!

ADD REPLYlink written 11 months ago by Daniel3.3k

If you're a student, it's 20 bucks for everything :)

I've been on that deal for what seems like most of my adult life :P As great as GIMP and ImageMagick are (and ImageMagick is particularly good with the command + extensions like montage), once you learn where everything is in Photoshop and Illustrator, there's really no competition. I mean, GIMP and IM are really good considering they're totally free - but I think you get what you pay for with Adobe's Creative Cloud. You even get cloud storage and some other perks (like your username/password dumped online every now and again...heh).

But the best thing about going Adobe is that there are online guides/tutorials for just about everything. I had a particularly tricky issue the other day involving intersecting two SVG heatmaps, which i could 'solve' in Illustrator in about 10 minutes thanks to a guide someone made in 2001 :P

ADD REPLYlink written 11 months ago by John11k
gravatar for Devon Ryan
11 months ago by
Devon Ryan64k
Freiburg, Germany
Devon Ryan64k wrote:
  1. Use a literate programming approach, such as with R-markdown or Jupyter/ipython.
  2. Version control everything. That annotation you got from Ensembl? Yeah, you better write down which release, because the next one might produce different results.
  3. Remember backups? Yeah, make sure you have those.
  4. Clean up after yourself. Don't be the guy/gal that occupies an excessive amount of space on the "very-expensive-overly-priced-poorly-performing-storage-array" (TM).

BTW, I would skip your number 2. You need to save primary unprocessed files (e.g., compressed fastq files or BAM/CRAM files after using bamHash or similar) and anything that takes an absurd amount of time to reproduce. You also need to save anotations and anything else that will likely be different if you download it again. However, don't save the results of every step, that's just going to blow up your storage costs and make it impossible to find anything. This happens to be identical to the common practice on the wet-lab side, where freezer/-80 space is always shared and HIGHLY limited.

ADD COMMENTlink written 11 months ago by Devon Ryan64k

I think the OP means to save the data that was used to plot a final published figure

This is a good point - it is easy to forget to save this data or at least document it very clearly.

Usually in the heat of the moment as we are focused on the data analysis we end up with many data inputs all from the same original set but these may be filtered one way or another, and we are going back and forth between them. Two months later when the reviews come back it is not so easy to figure out which data was plotted where.

ADD REPLYlink written 11 months ago by Istvan Albert ♦♦ 70k

Sure, you need to document how each version of each figure is made. Ideally you just extend whatever analysis you have by creating a new file (with a new name, or with some version or date associated to it that's then represented in your documentation/code).

ADD REPLYlink written 11 months ago by Devon Ryan64k

This happens to be identical to the common practice on the wet-lab side, where freezer/-80 space is always shared and HIGHLY limited.

Great analogy.

ADD REPLYlink written 8 months ago by Gjain4.9k
gravatar for fanli.gcb
11 months ago by
Los Angeles, CA
fanli.gcb590 wrote:

In addition to Devon's answers above:

  1. Sanity check. If you filter a dataset, check that this actually happened! Especially with projects that are linked with scripting, it is easy for unnoticed errors and omissions to occur.
  2. Don't reinvent the wheel. Chances are whatever you want to do has been done before. Biostars, stackoverflow, and seqanswers are all great places to search first.
  3. Take a step back and look at the big picture. What hypotheses are we trying to (dis)prove? What conclusions can we draw from the data, and what potential impact could they have?
  4. In line with #3, it's really important to keep on the scientific and technical literature. Bleeding edge approaches are great, and having a feel for where to apply them is great as well.
ADD COMMENTlink written 11 months ago by fanli.gcb590
gravatar for John
11 months ago by
John11k wrote:

Things not to do:

  1. Make Quality Control plots but never really look at them.

  2. Make Quality Control plots and look at them for too long.

ADD COMMENTlink modified 11 months ago • written 11 months ago by John11k

Or may be keep looking at them until you have the published version of them!

ADD REPLYlink modified 11 months ago • written 11 months ago by MAPK930

John. Don't tell us it is a joke. I love this joke.

ADD REPLYlink written 11 months ago by Shicheng Guo4.2k

I think watching QC plots is overrated and a thing of the past. I just got a dataset with over 150 samples, due to technical replicates it comes in over 600 fastq files with a FastQC generated pdf each (thankfully). Should I open all the 600 pdfs lying around on the server, by copying them to my computer, double clicking and viewing just to find the read quality is maybe ok for most and that all of the samples fail the sequence composition by position filter, like all samples before? If I need only 30 seconds per file, I will do this for 5 hours continued. I could imagine it is worth spending most 10 seconds per file, that I could do by putting all pdfs in one folder and then use the MacOS gallery view to only look at the first page of the pdf and only open the file when I spot some problem. A tabular overview would be much better.

ADD REPLYlink modified 11 months ago • written 11 months ago by Michael Dondrup41k

This is a perfect application for MultiQC by Phil Ewels.

ADD REPLYlink written 11 months ago by genomax226k

AfterQC is another great QC tool for fastq.

ADD REPLYlink modified 11 months ago • written 11 months ago by biomaster140

I think watching QC plots is overrated and a thing of the past.

Hi- I think this is a bit harsh. Rather I would say QC tools should make the output in a form easy to tabulate so that looking at hundreds of QCs is not difficult. In your case the problem was that PDFs are pretty much the opposite of "easy to tabulate" but if you had the raw output of fastqc you could fairly easy parse the text file containing the QC metrics.

ADD REPLYlink modified 11 months ago • written 11 months ago by dariober7.5k
gravatar for Khader Shameer
11 months ago by
Manhattan, NY
Khader Shameer17k wrote:
  1. Plan and manage the projects as modules: for example data clean up/QC, database management, analytics, predictive modeling, machine learning, statistical inference, data visualization and biological/clinical inference. This would help in the long-run for plug-and-play and easily build, test and deploy analytic pipelines.
  2. Assess the task: think before one spend countless hours on coding that slick function. Someone may have already made an open bio-* package for the bioinformatics task.
  3. Backup: Document, version control and backup everything (including the Linux/Unix command line using history). Bioinformatics is an applied practice and often contribute to scientific inference and clinical impact - here reproducibility is incredibly important. Tracking data provenance could help with reproducibility.
  4. Collaborate: co-create, and code-review
  5. Design thinking: Spend time to solve the task creatively, you have the choice to convert the bioinformatics task to a simple script or a package that many others could use.
  6. Engineer, don't just code. Understanding the technical details and know how to scale the systems from one data set to a 100 or 1000 data sets is key
  7. Future-proof the infrastructure - codes can crack, pipelines could break, good to have a mechanism to maintain and support the bioinformatics infrastructure
  8. Give back to the community - share code, analytics or blog. This would give more visibility and help to take the tool/paper to a large user base.
  9. Happy bioinformatics: Enjoy.
ADD COMMENTlink modified 11 months ago • written 11 months ago by Khader Shameer17k
gravatar for TriS
11 months ago by
United States, Buffalo
TriS2.6k wrote:
  1. when you are coding, add comments so that when you go back to it 3 months from now you remember/understand what/why you did it
  2. I use Google Slides to summarize my analysis, add thoughts and plots so that it's well organized and I can access them quickly, comment, go back to bed :)
  3. when using R, save the workspace so that you don't have to re-run the whole code when you need to go back
  4. when possible use more than one approach to analyze the data, if the result is consistent, great, if not, workout why
  5. can I mention again backups on server(s)?
ADD COMMENTlink written 11 months ago by TriS2.6k

I think Your No 3 is Great. I will use it later.

ADD REPLYlink written 11 months ago by Shicheng Guo4.2k

I always have an .RData and an .Rhistory file saved, with the workspace being the project directory. The directory itself is part of a project hierarchy, so that structures everything.

Initialize R projects with setwd(), close R sessions with save.image() and savehistory(), reopen them with load() and loadhistory() - that's my routine for every project.

ADD REPLYlink modified 11 months ago • written 11 months ago by Ram11k

Oh! On point #3 - Didn't know this could be That handy. I always select "No" as on any exit prompts. :-/ Thank you very much. Will try using it.

And, #2 definitely helpful. I create flow diagram in power point to explain pipeline/etc to my supervisor/PI.

ADD REPLYlink written 11 months ago by Bioinformatics_NewComer170

I am going to chime in with a disagreement on point 3. In my mind that feature is like the dark side in Star Wars or, as Yoda would say

Easily they flow, quick to join you when code you write. If once you start down the dark path, forever will it dominate your destiny. Consume you it will.

Basically reproducible analysis describes our ability to quickly reproduce a result - BUT that needs to happen from a raw data not some intermediate state that we don't quite remember how we got.

Don't get me wrong .RData and .RHistory are very useful, but as with many good things one needs to use these in moderation and understand their pitfalls.

ADD REPLYlink written 11 months ago by Istvan Albert ♦♦ 70k
gravatar for Sean Davis
11 months ago by
Sean Davis23k
National Institutes of Health, Bethesda, MD
Sean Davis23k wrote:

Do as much of your work in the public eye as possible. Github and the like have changed the way that I think and work.

ADD COMMENTlink written 11 months ago by Sean Davis23k
gravatar for Ryan Dale
11 months ago by
Ryan Dale4.4k
Bethesda, MD
Ryan Dale4.4k wrote:

If using data from other sources, keep track of where it came from. This can be as easy as a shell script with a bunch of wget or curl lines, but such a small thing can make a big difference in a few months when you forget where you got those files.

ADD COMMENTlink written 11 months ago by Ryan Dale4.4k

I also store md5sum or any other details like "date downloaded", "# sequences included" about the file - public datasets like uniprot_sprot.fasta keep on changing and it it's easier to compare with collaborators if you have md5sums

ADD REPLYlink written 11 months ago by Philipp Bayer4.1k
gravatar for biomaster
11 months ago by
San Jose
biomaster140 wrote:


Talk less, code more!

ADD COMMENTlink written 11 months ago by biomaster140

That might work as a punchline, but a good bioinformatician discusses with collaborators, checks for existing tools and "codes" only as a last option.

ADD REPLYlink written 11 months ago by Ram11k

OK but don't think less!

ADD REPLYlink written 11 months ago by Manu Prestat3.7k
gravatar for chen
11 months ago by
Strange Tools:
chen750 wrote:

Here are some of my tips:
1, sharing: make your code to be libraries, share them in github
2, visualization: always visualize your data
3, noise: keep in mind, data is always with noise, do filtering and cleaning before using them
4, git: use git to trace all your codes, manuscripts and slides
5, toolchain: maintain the tools you usually use to be a toolchain
6, testing: always test your pipelines/algorithms/tools with benchmark data

ADD COMMENTlink modified 11 months ago by Sean Davis23k • written 11 months ago by chen750
gravatar for Vivek
11 months ago by
Vivek1.6k wrote:

Slightly unusual but tar ball and archive all your work directories after you are done with a large project to save money on storage. Tape archives, especially the ones maintained by large genome centers are quite cheap and can be retrieved in a day or two. On the other hand you get billed heavily for data storage on file systems.

ADD COMMENTlink written 11 months ago by Vivek1.6k
gravatar for Israel Barrantes
6 months ago by
Israel Barrantes630 wrote:

Keep a command line history log for every program you installed/compiled, including version numbers of its dependencies. This will be really helpful not only for reproducibility, but also in case of moving up to new servers.

ADD COMMENTlink written 6 months ago by Israel Barrantes630
gravatar for Asaf
11 months ago by
Asaf4.1k wrote:

Make your projects reproducible.

ADD COMMENTlink written 11 months ago by Asaf4.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1438 users visited in the last hour