Has anybody made a systematic survey on quality control parameters in published NGS-data? I am especially interested in knowing about duplication levels people see in ChIP-Seq and RNA-Seq data.

Is there anything like this available on the encode data, or does one have to download that and QC oneself?

I think it would be a good idea and interesting if you could do it ;) It might not be easy though to interpret the results. F.e. if you find a certain percentage of duplicated reads, or low quality reads what does it actually mean for the interpretation of the data? I believe one would have to agree on a set of meaningful statistics to report for each dataset, one might be the percent mappability (correctly paired or not) to the reference sequence, but then again there will be so many parameters to consider.

I agree it would be interesting, I think sourcing the primary data for any given publication might be the stumbling block.

shouldn't the data be deposited in the SRA? It will still be a large burden to download and analyze everything.

@Michael: I'd be curious about the compliance with SRA submission as a related question. I think it is likely to be lower than microarrays.

This would be a very interesting, but challenging study. It seems that in the early days, every NGS paper that was published would receive a lot of probing from reviewers about library depth, coverage, percent duplicates, library complexity, quality distribution, etc. But now it is common to exclude mention of these kinds of metrics as if the methods were well-developed and standardized. But, in my opinion this is far from the case. Its not like with an Affy chip where you have a well-defined system, with well-developed processing methods and QC protocols. I think that RNA-seq is fundamentally better than arrays as a data type but the analysis and QC of such data still resides very much in the wild-west. One group to watch in this area is MAQC-III (also known as SEQC). They are "assessing the technical performance of next-generation sequencing platforms by generating benchmark datasets with reference samples and evaluating advantages and limitations of various bioinformatics strategies in RNA and DNA analyses". They are looking at multiple platforms, performing spike-in experiments, comparing performance to benchmarks (e.g., arrays), etc and have about 25 analysis teams involved. I think we can probably expect a special issue on the results of their first round some time in late 2012.

