Statistical Challenges For Analyzing Ngs Data?
1
1
Entering edit mode
11.2 years ago
Faisal ▴ 50

I am by profession a applied statistician. I have strong interest in analyzing next-generation sequences (NGS) data. I would like search and see a big picture about the challenges for statisticians. so far I have go through the recent issues related to bioinformatics journals, but still confused. Could you recommend some review article(s) or other material.

ngs statistics • 3.0k views
ADD COMMENT
1
Entering edit mode

What aspect of NGS are you interested in? First there is the process by which the tags are generated. For instance a ChIP experiment or other possibilities like GRO-seq, MNase-seq, DNase-seq FAIRE-seq and many more. Due to the fact that each has its own unique underlying biology, the nature of data generated can have different properties. Then there is the sequencing process itself which might be on an Illumina machine for instance. Then there is the process of detecting enrichment or peak-calling. Then there is the process of identifying reproducible results. Then there is the process of figuring out if two sets of genomic features overlap. (I am assuming here that you mean NGS for DNA-seq or RNA-seq experiments and not for other tasks like genome assembly or calling SNPs). In short, could you narrow it down?

ADD REPLY
0
Entering edit mode

@George is correct. It will take you more than a few days to find a niche problem. You can't shortcut the process toward publication. You need learn the field.

ADD REPLY
0
Entering edit mode

My comment is not meant as a discouragement. I can think of a few cool statistics-heavy papers that I would be happy to share. However, it depends on what part of the field you are most likely to be working on.

ADD REPLY
0
Entering edit mode

I am mainly interested DNA-seq or RNA-seq data types and the process of detecting enrichment or peak-calling. Furthermore my focused research is approximate inference for network biology/complex system and estimation of missing values.

ADD REPLY
2
Entering edit mode

I'd suggest getting some data and a set of questions to start. You'll learn a lot by just going through the process of analyzing data like those from the ENCODE project. Expect that to become comfortable with the data and questions could take a few weeks to months.

ADD REPLY
3
Entering edit mode

I think I would suggest modENCODE project, just because the genomes are smaller and thus processing is often faster. Assuming that the statistics is what he cares about and not the organism per se.

ADD REPLY
1
Entering edit mode
11.2 years ago

I think the commenters on your post are on the right track. You really need to specify the domain of application to even define the right challenges.

But just for the sake of it I will attempt to zoom all the way out and formulate what I think the main challenges are:

  1. Systematic errors - every single step of sample preparation and sequencing carries a bias and since we make hundreds of millions of measurements the effects of these are always visible and often stronger than the effects that are to be measured. For example: data collected on Mondays may be more consisten to each other than data on healthy patients.
  2. Multiple comparisons - we usually simultaneously measure all the components of the biological system - many (most) of which may still be unknown
ADD COMMENT

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6