This is more a proposal to collaborate on making a new QC tool, or if you will a description of a hypothetical tool that possibly does not exist.
Background: FastQC is used by many as a QC tool for sequencing data, and it has its merits. However, its evaluation is based on theoretical assumptions of how sequencing data should behave, not how they are in reality. Also, since FastQC has been first implemented, a lot has been learned about real sequencing data, and also the technology has advanced massively. As an example, a lot of questions on BioStars are related to the analysis of base composition in FastQC, that when applied to RNA-seq data from random priming regularly indicates a QC failure. The traffic light system to summarize quality is also equally suggestive, simplistic and misleading at the same time, because it does not take into account how other data sets look.
I would like to propose a different approach instead, based on empirical data, that is similar to the quality rating of protein 3D structure in PDB. Data would be analyzed and compared with data from comparable sequencing experiments, and then summarized by quantile of the statistics in comparison to other data sets. There is ample (possibly too much) data in SRA that could be used.
Let me know your critique.
- This needs to be a collaborative project, because of the compute requirements
- The project needs to contain a "survey" phase, where data is analyzed and pre-computed stats are made.
- Need to agree on a set of summary stats to define quality
- Define an exchange format (e.g. in SQLite) for QC so that the data can be computed on different nodes and exchanged between them.
- Implement a distributed application that user can easily install (like SETI@home), scans a certain range of SRA address space and delivers the output back.
- Possibly a lot of stats are already available in SRA