Question

Forum:Getting started with NGS analysis, common issues?

2

Entering edit mode

9.2 years ago

Keith ▴ 50

Hi All,

I'm starting a company called Stirplate.io which aims to automate data analysis for NGS. In my time as a researcher, I've found that a lot of biologists have a hard time working with NGS data considering they have no Linux background and no time to learn how to use the tools available.

That being said, we are trying to boil down exactly what researchers need in order to be successful with their analysis, regardless of their level of bioinformatics comprehension.

Another way to think about this is... if you could snap your fingers and educate biologists about these 3-5 things before they start doing NGS analysis, what would they be?

Thanks for the help!

Keith

software-error next-gen • 3.2k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Keith ▴ 50

score 7 · Answer 1 · 2015-02-20

7

Entering edit mode

9.2 years ago

Michael 54k

if you could snap your fingers and educate biologists about these 3-5 things before they start doing NGS analysis, what would they be?

IMHO, biologists should not do NGS data analysis, but bioinformaticians, or computational biologists, or data analysts with solid background in statistics, programming, data-management and unix. Or do you let your sysadmin do the RNA-extraction? But if it need be 3-5 items:

replication
replication
replication
course in applied statistics & R
10 years of experience with unix, shell, perl, python

ADD COMMENT • link 9.2 years ago by Michael 54k

4

Entering edit mode

Why 10 years? On second thought, I started Linux in 2006. Damn I'm getting old!

ADD REPLY • link 9.2 years ago by Ram 43k

1

Entering edit mode

"Damn I'm getting old!"

I second that :)

ADD REPLY • link 9.2 years ago by mxs ▴ 530

0

Entering edit mode

Well, if I could snap my fingers..., but 5 years could do, or: push one button in Galaxy.

ADD REPLY • link 9.2 years ago by Michael 54k

Ram · Answer 2 · 2015-02-21

Here's my take on this, partially repeating other answers:

Basic statics: At least to understand why replication matters, why a p-value of 0.001 is not too exciting if it comes from a batch of 20000 tests.
R programming: At least to be able to follow a Bioconductor vignettes and replace Excel for common tasks like sorting and subsetting a table, plotting.
Unix skills: Able to move around files and dirs (cd, ls, mkdir,etc) and Unix tools to manipulate files (sort, cut, awk,etc). Able to install and launch a command line program (provided that there are no quirks in the installation process).
Tools for bioinformatics: Almost for sure for NGS you need samtoolsand bedtoolsor equivalent. Understanding of common file formats: SAM, bed, gtf.
Good house keeping: Document what you have done, just like in a labbook. Avoid duplicating data, copying files and keeping stuff "just in case..." (some concepts from the relational database world would be handy).

Some scripting language like python would be a plus but with R+Unix tools you already go a long way.

I wouldn't require all the stuff above, but at least I would ask for a willingness to learn.

In general, you would need enough understanding and familiarity to be able to present your problem to "the expert" (say the pure statistician/mathematician/computer scientist) and be able to make use of the answer. Essentially be able to communicate with non-biologists.

I'd like to disagree with Michael Dondrup when he says biologists shouldn't do (NGS) analysis. I think the biologist is the one that best understands the problem so he/she is in the best position to ask meaningful questions and spot features in the data attributable to potentially interesting biological or technical causes. Then of course, you still need "proper" bioinformatician in the team, but I think you get a lot done with basic expertise from the list above.

(By the way, I don't claim to be an expert. In fact the distance between what I know and what I should know seems to keep increasing with time. Sigh...)

Ram · Answer 3 · 2015-02-20

3

Entering edit mode

9.2 years ago

Pierre Lindenbaum 161k

Another way to think about this is... if you could snap your fingers and educate biologists about these 3-5 things before they start doing NGS analysis, what would they be?

Learn the SAM format: what is in the header? what are the columns? what are the flags? what is soft/hard clipping? etc ...

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Pierre Lindenbaum 161k

Ram · Answer 4 · 2015-02-20

Something which always comes up when I speak with new (biology) graduate students is that I am constantly explaining the limitations of, and potential sources of false positives in bioinformatic pipelines. Our in-house DNAseq pipeline runs basically with the push of a button, but it takes some training to interpret the results, especially since the default parameters are tuned for high sensitivity.

My list for DNAseq would be:

Sequencing-by-synthesis basics (homopolymers).
Mapping basics (segmental duplications, repetitive regions, other insertions)
Variant calling basics (repeats, variation missed by variant callers, read position rank sum, etc.)

Other types of analysis:

A power calculation
The definitions of technical variability, biological variability and treatment effect.

Ram · Answer 5 · 2015-02-22

1

Entering edit mode

9.2 years ago

daiefa123 ▴ 150

If you want to get fast introduced to NGS data analysis, I would recommend one of the different workshops available:

If you search within BioStars, or in google for 'NGS workshop' or 'NGS course', you will find many many more. For me it was worth the money.

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by daiefa123 ▴ 150

0

Entering edit mode

Mmmhh... Yes and no I would say. Yes because these courses are great, well taught etc. But no because still you have to get your hands dirty with you own data where things don't fit together nicely as per textbook. I see many people taking R/python/Unix/NGS courses but then not actually putting that knowledge to work. But overall, yes, it's good advice.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by dariober 14k

0

Entering edit mode

I have to agree with dariober. It's pretty often like he described the situation. Dealing with your own data is something completely different and in the courses (also our courses), dummy data are used. I think these courses are very good for getting a first knowledge base (file formats, list of good tools, basic idea of pipelines, etc.). They will help you starting your project, but to finish them, you have to think and/or read a lot. But that makes absolute sense, no? How should anyone be able to teach the expertise of several years (10 in my case) in one week? That's not possible.

Summarizing that, courses will help you getting basic knowledge in a concentrated and organized way. You will be able to directly start your work, probably skipping the first two month of reading books. But they will not (why should they) do your research work.

! Advertisement !

For the latter, we offer personalized courses where you can use your own data and compose your own workshop agenda, covering the topics you need: http://www.ecseq.com/workshops/personalized.html

! Advertisement !

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by David Langenberger 11k

Ram · Answer 6 · 2015-02-22

I think that the technological/technical gap of programming and linux etc can be dealt with. The more important thing the biologist should be able to do is to translate the raw data to insights. I helped more than one friend that did some experiment are were overwhelmed from the amounts of data they had no idea what to do with.

My short list is:

How to manage the data. Which format to use? where? how to integrate data from different experiments?
Deep understanding of the statistics in order to design the experiments correctly. For instance, is it better to run 10 replicates with 2M reads each or 5 replicates with 4M reads? What controls should I use?
To understand the trade-of between amount of data and its simplicity. When we shrink the data we make it more accessible but we lose some of it. For instance counting the number of reads per gene in an RNA-seq experiment is a great way to see if they went up or down but you can't tell if a lot of reads came from the 5' end.
Know the limitations and biases of the experiment and the analysis.