Question

Statistical Challenges For Analyzing Ngs Data?

1

Entering edit mode

11.2 years ago

Faisal ▴ 50

I am by profession a applied statistician. I have strong interest in analyzing next-generation sequences (NGS) data. I would like search and see a big picture about the challenges for statisticians. so far I have go through the recent issues related to bioinformatics journals, but still confused. Could you recommend some review article(s) or other material.

ngs statistics • 3.0k views

ADD COMMENT • link updated 11.2 years ago by Istvan Albert 100k • written 11.2 years ago by Faisal ▴ 50

1

Entering edit mode

What aspect of NGS are you interested in? First there is the process by which the tags are generated. For instance a ChIP experiment or other possibilities like GRO-seq, MNase-seq, DNase-seq FAIRE-seq and many more. Due to the fact that each has its own unique underlying biology, the nature of data generated can have different properties. Then there is the sequencing process itself which might be on an Illumina machine for instance. Then there is the process of detecting enrichment or peak-calling. Then there is the process of identifying reproducible results. Then there is the process of figuring out if two sets of genomic features overlap. (I am assuming here that you mean NGS for DNA-seq or RNA-seq experiments and not for other tasks like genome assembly or calling SNPs). In short, could you narrow it down?

ADD REPLY • link 11.2 years ago by KCC ★ 4.1k

0

Entering edit mode

@George is correct. It will take you more than a few days to find a niche problem. You can't shortcut the process toward publication. You need learn the field.

ADD REPLY • link 11.2 years ago by Zev.Kronenberg 12k

0

Entering edit mode

My comment is not meant as a discouragement. I can think of a few cool statistics-heavy papers that I would be happy to share. However, it depends on what part of the field you are most likely to be working on.

ADD REPLY • link 11.2 years ago by KCC ★ 4.1k

0

Entering edit mode

I am mainly interested DNA-seq or RNA-seq data types and the process of detecting enrichment or peak-calling. Furthermore my focused research is approximate inference for network biology/complex system and estimation of missing values.

ADD REPLY • link 11.2 years ago by Faisal ▴ 50

2

Entering edit mode

I'd suggest getting some data and a set of questions to start. You'll learn a lot by just going through the process of analyzing data like those from the ENCODE project. Expect that to become comfortable with the data and questions could take a few weeks to months.

ADD REPLY • link 11.2 years ago by Sean Davis 26k

3

Entering edit mode

I think I would suggest modENCODE project, just because the genomes are smaller and thus processing is often faster. Assuming that the statistics is what he cares about and not the organism per se.

ADD REPLY • link 11.2 years ago by KCC ★ 4.1k

score 1 · Answer 1 · 2013-02-08

I think the commenters on your post are on the right track. You really need to specify the domain of application to even define the right challenges.

But just for the sake of it I will attempt to zoom all the way out and formulate what I think the main challenges are:

Systematic errors - every single step of sample preparation and sequencing carries a bias and since we make hundreds of millions of measurements the effects of these are always visible and often stronger than the effects that are to be measured. For example: data collected on Mondays may be more consisten to each other than data on healthy patients.
Multiple comparisons - we usually simultaneously measure all the components of the biological system - many (most) of which may still be unknown