Forum: Getting started with NGS analysis, common issues?
gravatar for Keith
5.0 years ago by
United States
Keith50 wrote:

Hi All,

I’m starting a company called which aims to automate data analysis for NGS. In my time as a researcher, I’ve found that a lot of  biologists have a hard time working with NGS data considering they have no Linux background and no time to learn how to use the tools available.


That being said, we are trying to boil down exactly what researchers need in order to be successful with their analysis, regardless of their level of bioinformatics comprehension. 


Another way to think about this is… if you could snap your fingers and educate biologists about these 3-5 things before they start doing NGS analysis, what would they be?

Thanks for the help!


next-gen forum software error • 1.8k views
ADD COMMENTlink modified 5.0 years ago by Asaf7.0k • written 5.0 years ago by Keith50
gravatar for Michael Dondrup
5.0 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

if you could snap your fingers and educate biologists about these 3-5 things before they start doing NGS analysis, what would they be?

IMHO, biologists should not do NGS data analysis, but bioinformaticians, or computational biologists, or data analysts with solid background in statistics, programming, data-management and unix. Or do you let your sysadmin do the RNA-extraction? But if it need be 3-5 items:

  1. replication
  2. replication
  3. replication
  4. course in applied statistics & R
  5. 10 years of experience with unix, shell, perl, python
ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Michael Dondrup47k

Why 10 years? On second thought, I started Linux in 2006. Damn I'm getting old!

ADD REPLYlink written 5.0 years ago by RamRS25k

"Damn I'm getting old!"

I second that :)

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by mxs530

Well, if I could snap my fingers..., but 5 years could do, or: push one button in Galaxy.

ADD REPLYlink written 5.0 years ago by Michael Dondrup47k
gravatar for dariober
5.0 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

Here's my take on this, partially repeating other answers:

  • Basic statics: At least to understand why replication matters, why a p-value of 0.001 is not too exciting if it comes from a batch of 20000 tests.
  • R programming: At least to be able to follow a Bioconductor vignettes and replace Excel for common tasks like sorting and subsetting a table, plotting.
  • Unix skills: Able to move around files and dirs (cd, ls, mkdir, etc) and Unix tools to manipulate files (sort, cut, awk, etc). Able to install and launch a command line program (provided that there are no quirks in the installation process).
  • Tools for bioinformatics: Almost for sure for NGS you need samtools and bedtools or equivalent. Understanding of common file formats: SAM, bed, gtf.
  • Good house keeping: Document what you have done, just like in a labbook. Avoid duplicating data, copying files and keeping stuff "just in case..." (some concepts from the relational database world would be handy).

Some scripting language like python would be a plus but with R+Unix tools you already go a long way.

I wouldn't require all the stuff above, but at least I would ask for a willingness to learn.

In general, you would need enough understanding and familiarity to be able to present your problem to "the expert" (say the pure statistician/mathematician/computer scientist) and be able to make use of the answer. Essentially be able to communicate with non-biologists.

I'd like to disagree with  Michael Dondrup when he says biologists shouldn't do (NGS) analysis. I think the biologist is the one that best understands the problem so he/she is in the best position to ask meaningful questions and spot features in the data attributable to potentially interesting biological or technical causes. Then of course, you still need "proper" bioinformatician in the team, but I think you get a lot done with basic expertise from the list above.

(By the way, I don't claim to be an expert. In fact the distance between what I know and what I should know seems to keep increasing with time. Sigh...)

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by dariober11k
gravatar for Pierre Lindenbaum
5.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum126k wrote:

"Another way to think about this is… if you could snap your fingers and educate biologists about these 3-5 things before they start doing NGS analysis, what would they be?"


learn the SAM format: what is in the header ? what are the columns ? , what are the flags ? what is soft/hard clipping ?  etc ...





ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Pierre Lindenbaum126k
gravatar for donfreed
5.0 years ago by
Mountain View, CA
donfreed1.5k wrote:

Something which always comes up when I speak with new (biology) graduate students is that I am constantly explaining the limitations of, and potential sources of false positives in bioinformatic pipelines. Our in-house DNAseq pipeline runs basically with the push of a button, but it takes some training to interpret the results, especially since the default parameters are tuned for high sensitivity.

My list for DNAseq would be:

1. Sequencing-by-synthesis basics (homopolymers).
2. Mapping basics (segmental duplications, repetitive regions, other insertions)
3. Variant calling basics (repeats, variation missed by variant callers, read position rank sum, etc.)

Other types of analysis:

1. A power calculation
2. The definitions of technical variability, biological variability and treatment effect.

ADD COMMENTlink written 5.0 years ago by donfreed1.5k
gravatar for daiefa123
5.0 years ago by
daiefa123150 wrote:

If you want to get fast introduced to NGS data analysis, I would recommend one of the different workshops available:

If you search within BioStars, or in google for 'NGS workshop' or 'NGS course', you will find many many  more. For me it was worth the money.

ADD COMMENTlink written 5.0 years ago by daiefa123150

Mmmhh... Yes and no I would say. Yes because these courses are great, well taught etc. But no because still you have to get your hands dirty with you own data where things don't fit together nicely as per textbook. I see many people taking R/python/Unix/NGS courses but then not actually putting that knowledge to work. But overall, yes, it's good advice.

ADD REPLYlink written 5.0 years ago by dariober11k

I have to agree with dariober. It's pretty often like he described the situation. Dealing with your own data is something completely different and in the courses (also our courses), dummy data are used. I think these courses are very good for getting a first knowledge base (file formats, list of good tools, basic idea of pipelines, etc.). They will help you starting your project, but to finish them, you have to think and/or read a lot. But that makes absolute sense, no? How should anyone be able to teach the expertise of several years (10 in my case) in one week? That's not possible.

Summarizing that, courses will help you getting basic knowledge in a concentrated and organized way. You will be able to directly start your work, probably skipping the first two month of reading books. But they will not (why should they) do your research work. 

! Advertisement ! For the latter, we offer personalized courses where you can use your own data and compose your own workshop agenda, covering the topics you need: ! Advertisement ! 

ADD REPLYlink written 5.0 years ago by David Langenberger9.2k
gravatar for Asaf
5.0 years ago by
Asaf7.0k wrote:

I think that the technological/technical gap of programming and linux etc can be dealed with. The more important thing the biologist should be able to do is to translate the raw data to insights. I helped more than one friend that did some experiment are were overwhelmed from the amounts of data they had no idea what to do with.

My short list is:

1. How to manage the data. Which format to use? where? how to integrate data from different experiments?

2. Deep understanding of the statistics in order to design the experiments correctly. For instance, is it better to run 10 replicates with 2M reads each or 5 replicates with 4M reads? What controls should I use?

3. To understand the trade-of between amount of data and its simplicity. When we shrink the data we make it more accessible but we lose some of it. For instance counting the number of reads per gene in an RNA-seq experiment is a great way to see if they went up or down but you can't tell if a lot of reads came from the 5' end.

4. Know the limitations and biases of the experiment and the analysis. 

ADD COMMENTlink written 5.0 years ago by Asaf7.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1796 users visited in the last hour