Tool:Collaboration on an empirical QC tool
2
6
Entering edit mode
4.5 years ago

This is more a proposal to collaborate on making a new QC tool, or if you will a description of a hypothetical tool that possibly does not exist.

Background: FastQC is used by many as a QC tool for sequencing data, and it has its merits. However, its evaluation is based on theoretical assumptions of how sequencing data should behave, not how they are in reality. Also, since FastQC has been first implemented, a lot has been learned about real sequencing data, and also the technology has advanced massively. As an example, a lot of questions on BioStars are related to the analysis of base composition in FastQC, that when applied to RNA-seq data from random priming regularly indicates a QC failure. The traffic light system to summarize quality is also equally suggestive, simplistic and misleading at the same time, because it does not take into account how other data sets look.

I would like to propose a different approach instead, based on empirical data, that is similar to the quality rating of protein 3D structure in PDB. Data would be analyzed and compared with data from comparable sequencing experiments, and then summarized by quantile of the statistics in comparison to other data sets. There is ample (possibly too much) data in SRA that could be used.

Let me know your critique.

Brainstorming

  • This needs to be a collaborative project, because of the compute requirements
  • The project needs to contain a "survey" phase, where data is analyzed and pre-computed stats are made.
  • Need to agree on a set of summary stats to define quality
  • Define an exchange format (e.g. in SQLite) for QC so that the data can be computed on different nodes and exchanged between them.
  • Implement a distributed application that user can easily install (like SETI@home), scans a certain range of SRA address space and delivers the output back.
  • Possibly a lot of stats are already available in SRA
RNA-Seq sequencing QC Tool • 1.1k views
ADD COMMENT
2
Entering edit mode

Great idea! I've started hiding the FastQC results in a nicer multiQC output so end users stop seeing the traffic light and don't then worry about it. This would be a nice next step.

ADD REPLY
1
Entering edit mode

Good idea. While it awaits implementation, it would be useful to modify FastQC such that an appropriate "limits" file can be selected depending on type of sequencing at hand so the display of "warnings/failures" could be modified. That would minimize accidents due to the traffic lights that appear to trip many new users.

ADD REPLY
0
Entering edit mode

Great. Possibly we can also use modified FastQC code for final summary stats, but it needs to be "real fast" to summarize 1000's of files per organism and work on sra files directly, ideally streaming them.

ADD REPLY
1
Entering edit mode

We should select data of known provenance from SRA. Biostar users could nominate their own datasets since they presumably know them well and are confident about their quality/utility. Others can cross-check and approve.

ADD REPLY
0
Entering edit mode

An important aspect of such a study would be to generate representative samples, not only the ones that are believed to be good. Samples also need to cover a wide range of technologies and species.

ADD REPLY
0
Entering edit mode

+1 from me. One thing I like of FastQC is that it works with minimal input, just give it a fastq file and that's it. This comes at cost of simplistic output of course. Still I think it's important for a QC tool to require minimal configuration, or at least have default settings where all you need is the input data you want to QC. This way the tool is easy and useful also when you want a quick and dirty assessment of your data.

ADD REPLY
2
Entering edit mode
4.5 years ago
John 13k

It sounds like you're thinking along the same lines as SeQC: https://github.com/JohnLonginotto/seqc

Going through your list, SeQC is:

  • collaborative by design, in that SeQC servers are designed to be networked together to share stat data.
  • all stats are pre-computed. It's a stat database after all, raw data isn't saved. You can still query the database in interesting ways via SQL though.
  • Instead of defining what all the stats are up front, stats are community-generated, via modules. A module is typically 10 lines of Python code, and calculates 1 thing from a read. Stats can depend on other stats, and the dependency tree is figured out such if 10 stats need the DNA sequence, the DNA sequence isn't decoded from the BAM 10 times, just once, and reused.
  • The result is anyone can throw up a new stat module to answer a question. I've been making little modules to solve Biostar user's questions for years, although they're usually so specific to the user's problem that they don't get much reused elsewhere.
  • The exchange format is SQLite and Postgres. All the SQL is polyglot, except for JSON stuff which only Postgres handles well at the moment. I'm told SQLite will do so soon. SQLite follows Postgres very closely, in fact "how does Postgres do it?" is their motto, so writing code to support both isn't actually very hard. Thus, users can crank out a SQLite database quickly, or can load it into an existing PG database for their production environment, making use of the multi-core + cacheing goodness of Postgres.
  • Yup it's distributed, as in you make your stat server, and if you set it from private to public your server joins the network of publicly accessible SeQC servers.

SeQC is currently a bit of a mess, as I was in the middle of working on it when i got pulled off to work on something else. I'm scheduled to start working on it again in like 1 week though, and I hope the stat network will be up again by the end of May. I appreciate that you are talking about collecting a lot of stat data and in a distributed way analysing it to make empirical quality scores, and not the general problem of sharing/distributing summary statistics, but I feel it would be a missed opportunity not to combine the two ideas since they compliment each other well.

All in all, if you can think of the stats you'd need to collect, I can make the SeQC modules to calculate those stats, and then that's pretty much it done. Just get a bunch of people to install it, churn through their data, and flip the switch from private to public. The result is a bunch of public-facing but secured SQL servers (secured behind a REST API with IP query cooldowns, etc), that will perform SQL queries on the stats you made and deliver the results via JSON.

ADD COMMENT
1
Entering edit mode

This is great! And you got the technology already implemented. I think SeQC could be a very important contribution and should be publishable easily. I think I can help you with running the survey to generate the summary stats of existing data and test some stuff on our servers, if you like.

One important aspect for a survey phase for me is to get representative samples from the archives and run the summary directly on the SRA files.

ADD REPLY
0
Entering edit mode

That would be great Michael! OK so i'll get to work on tidying up the code base and producing modules for the proposed statistics.

For some stats there's more than one way to calculate it (e.g. duplication rate could be based on sequence or on mapping position, and the distinction is purely subjective), so will need input from others on how best to proceed when the time comes. Will keep you posted :)

ADD REPLY
0
Entering edit mode

We got a new server I could test this on it for a while, great if the code could be multi-threaded or otherwise parallelized. (I'm not so good(tm) with Python though.) You can contact me by email if you want, otherwise we might clutter up the thread a bit.

ADD REPLY
0
Entering edit mode
4.5 years ago

Proposed QC statistics

Most of these are also in FastQC. Do all scores need to be reducible to a scalar?

  • Summarized Base Quality Score (Phred)
  • Homogeneity of quality distribution
  • Per base sequence distribution (should this be there?) or deviation from uniform distribution
  • kmer content enrichment
  • adapter content
  • duplication rate

  • ....

  • alignment rate (?)
ADD COMMENT
0
Entering edit mode

One question becomes if people are allowed to define an experiment type or whether that is somehow inferred. If either of those is the case then something regarding per base sequence distribution is reasonable. Otherwise, I suspect it'll be problematic.

BTW, IHEC has an assay standards committee, I wonder if they've started defining any of this already.

ADD REPLY
0
Entering edit mode

I think defining an experiment or library type, like RNA-seq, DNA-seq, ChiP-seq needs to be in the parameter matrix, otherwise we will keep comparing apples and oranges. The following is an example that can be retrieved from SRA's Library annotation

Library:
Name: 2359350942
Instrument: Illumina Genome Analyzer
Strategy: WGS
Source: GENOMIC
Selection: RANDOM
Layout: SINGLE

and another one. Those do not play in the same league when it comes to QC

Library:
Name: Lsalmonis_LifeCycle_Pool
Instrument: Illumina HiSeq 2000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: other
Layout: PAIRED
ADD REPLY
0
Entering edit mode

Cool, in that case the per base distribution and kmer content would be useful.

ADD REPLY

Login before adding your answer.

Traffic: 1636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6