Question: Statistical Challenges For Analyzing Ngs Data?
1
gravatar for Faisal
6.4 years ago by
Faisal50
Faisal50 wrote:

I am by profession a applied statistician. I have strong interest in analyzing next-generation sequences (NGS) data. I would like search and see a big picture about the challenges for statisticians. so far I have go through the recent issues related to bioinformatics journals, but still confused. Could you recommend some review article(s) or other material.

ngs statistics • 2.2k views
ADD COMMENTlink modified 6.4 years ago by Istvan Albert ♦♦ 80k • written 6.4 years ago by Faisal50
1

What aspect of NGS are you interested in? First there is the process by which the tags are generated. For instance a ChIP experiment or other possibilities like GRO-seq, MNase-seq, DNase-seq FAIRE-seq and many more. Due to the fact that each has its own unique underlying biology, the nature of data generated can have different properties. Then there is the sequencing process itself which might be on an Illumina machine for instance. Then there is the process of detecting enrichment or peak-calling. Then there is the process of identifying reproducible results. Then there is the process of figuring out if two sets of genomic features overlap. (I am assuming here that you mean NGS for DNA-seq or RNA-seq experiments and not for other tasks like genome assembly or calling SNPs). In short, could you narrow it down?

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by KCC3.9k

@George is correct. It will take you more than a few days to find a niche problem. You can't shortcut the process toward publication. You need learn the field.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Zev.Kronenberg11k

My comment is not meant as a discouragement. I can think of a few cool statistics-heavy papers that I would be happy to share. However, it depends on what part of the field you are most likely to be working on.

ADD REPLYlink written 6.4 years ago by KCC3.9k

I am mainly interested DNA-seq or RNA-seq data types and the process of detecting enrichment or peak-calling. Furthermore my focused research is approximate inference for network biology/complex system and estimation of missing values.

ADD REPLYlink written 6.4 years ago by Faisal50
2

I'd suggest getting some data and a set of questions to start. You'll learn a lot by just going through the process of analyzing data like those from the ENCODE project. Expect that to become comfortable with the data and questions could take a few weeks to months.

ADD REPLYlink written 6.4 years ago by Sean Davis25k
3

I think I would suggest modENCODE project, just because the genomes are smaller and thus processing is often faster. Assuming that the statistics is what he cares about and not the organism per se.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by KCC3.9k
1
gravatar for Istvan Albert
6.4 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

I think the commenters on your post are on the right track. You really need to specify the domain of application to even define the right challenges.

But just for the sake of it I will attempt to zoom all the way out and formulate what I think the main challenges are:

  1. Systematic errors - every single step of sample preparation and sequencing carries a bias and since we make hundreds of millions of measurements the effects of these are always visible and often stronger than the effects that are to be measured. For example: data collected on Mondays may be more consisten to each other than data on healthy patients.
  2. Multiple comparisons - we usually simultaneously measure all the components of the biological system - many (most) of which may still be unknown
ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Istvan Albert ♦♦ 80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 901 users visited in the last hour