Question: Duplication And Quality Control In Published Ngs Data
gravatar for Ido Tamir
8.9 years ago by
Ido Tamir5.1k
Ido Tamir5.1k wrote:

Has anybody made a systematic survey on quality control parameters in published NGS-data? I am especially interested in knowing about duplication levels people see in ChIP-Seq and RNA-Seq data.

Is there anything like this available on the encode data, or does one have to download that and QC oneself?

quality qc • 2.5k views
ADD COMMENTlink written 8.9 years ago by Ido Tamir5.1k

I think it would be a good idea and interesting if you could do it ;) It might not be easy though to interpret the results. F.e. if you find a certain percentage of duplicated reads, or low quality reads what does it actually mean for the interpretation of the data? I believe one would have to agree on a set of meaningful statistics to report for each dataset, one might be the percent mappability (correctly paired or not) to the reference sequence, but then again there will be so many parameters to consider.

ADD REPLYlink written 8.9 years ago by Michael Dondrup48k

I agree it would be interesting, I think sourcing the primary data for any given publication might be the stumbling block.

ADD REPLYlink written 8.9 years ago by User 5913k

shouldn't the data be deposited in the SRA? It will still be a large burden to download and analyze everything.

ADD REPLYlink written 8.9 years ago by Michael Dondrup48k

@Michael: I'd be curious about the compliance with SRA submission as a related question. I think it is likely to be lower than microarrays.

ADD REPLYlink written 8.9 years ago by Sean Davis26k
gravatar for Obi Griffith
8.9 years ago by
Obi Griffith19k
Washington University, St Louis, USA
Obi Griffith19k wrote:

This would be a very interesting, but challenging study. It seems that in the early days, every NGS paper that was published would receive a lot of probing from reviewers about library depth, coverage, percent duplicates, library complexity, quality distribution, etc. But now it is common to exclude mention of these kinds of metrics as if the methods were well-developed and standardized. But, in my opinion this is far from the case. Its not like with an Affy chip where you have a well-defined system, with well-developed processing methods and QC protocols. I think that RNA-seq is fundamentally better than arrays as a data type but the analysis and QC of such data still resides very much in the wild-west. One group to watch in this area is MAQC-III (also known as SEQC). They are "assessing the technical performance of next-generation sequencing platforms by generating benchmark datasets with reference samples and evaluating advantages and limitations of various bioinformatics strategies in RNA and DNA analyses". They are looking at multiple platforms, performing spike-in experiments, comparing performance to benchmarks (e.g., arrays), etc and have about 25 analysis teams involved. I think we can probably expect a special issue on the results of their first round some time in late 2012.

ADD COMMENTlink written 8.9 years ago by Obi Griffith19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2223 users visited in the last hour