Question

Forum:Explaining purpose of using FastQC to a colleague

4

Entering edit mode

7.2 years ago

ropolocan ▴ 810

Hello Biostars community,

The other day a colleague claimed that I do not need to "waste my time" running FastQC on my data because "we already have a good idea of the quality of the data from the cycle quality reports provided by the Illumina MiSeq instrument during the sequencing run". I strongly disagree with my colleague, but I see this disagreement as an opportunity to educate more on the usefulness of FastQC. In my case, I find it very useful to see the distributions of quality scores per base positions and the frequency of reads with a certain average quality score as well as the over-representation of k-mers. I use this information to decide how to pre-process reads before de novo assembly or reference mapping. I'm curious: how would you explain to a colleague that FastQC is worth using? What other advantages could be mentioned?

Thanks for your time.

fastqc • 3.7k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 7.2 years ago by ropolocan ▴ 810

1

Entering edit mode

we already have a good idea of the quality of the data from the cycle quality reports provided by the Illumina MiSeq instrument during the sequencing run

That statement while true from the perspective of your colleague immediately brings to mind the famous story of "blind men and an elephant".

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

Nice analogy, and appropriate for this case. Thanks for your answer, @genomax2.

ADD REPLY • link 7.2 years ago by ropolocan ▴ 810

0

Entering edit mode

Do you disagree because you think the Illumina quality report (I assume from SAV or BaseSpace) is not sufficient or because you think that everyone should check the QC metrics?

ADD REPLY • link 7.2 years ago by igor 13k

score 3 · Accepted Answer · 2017-02-04

3

Entering edit mode

7.2 years ago

John 13k

Many people unfortunately see Quality Control as an opportunity for analysts to discard near-perfect data, rather than what it really is, an opportunity for analysts to learn more about their data. I once was denied the ability to run QC tools on my data (due to the raw data's file size) and after a few months of analysis a collaborator had to tell us that the data we assumed was male was actually female due to a sample mixup.

There's no such thing a quality control in bioinformatics. Data is always of dubious quality. QC should just be renamed to "looking at data" and then i think people would be more lenient to people, er, looking at data..

In conclusion there are no surprising and unusual benefits to using FastQC that you might have not already thought of - there are only surprising and unusual problems as a result of not using it.

ADD COMMENT • link 7.2 years ago by John 13k

2

Entering edit mode

In conclusion there are no surprising and unusual benefits to using FastQC that you might have not already thought of - there are only surprising and unusual problems as a result of not using it.

I really liked this phrase, and I think it is a very good way to look at how we approach using FastQC and other QC tools. I agree that looking at data and learning more about the data is not only desirable, but necessary. Thank you very much for your answer @John!

ADD REPLY • link 7.2 years ago by ropolocan ▴ 810

2

Entering edit mode

I think people interchange quality control for quality assessment (i.e., looking at the data) - and FASTQC is really the latter.

ADD REPLY • link 7.2 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Very good point. Quality assessment seems like a more appropriate term. Thanks for your answer.

ADD REPLY • link 7.2 years ago by ropolocan ▴ 810

score 3 · Accepted Answer · 2017-02-04

3

Entering edit mode

7.2 years ago

igor 13k

I assume when they said they were worried about wasting your time, what they actually meant is they were worried about wasting their time. Based on my own experience having worked for a sequencing facility, the majority of the time, when a client mentions FastQC, it will lead to a long and painful process. They are almost guaranteed to get a red failing mark for at least one metric for at least one sample. They will be disappointed. They will blame the sequencing facility. A long discussion will ensue about interpreting FastQC metrics. Time will be spent. Unless the sequencing is a complete failure (in which case, we wouldn't send the data in the first place), the final consensus will be to run the analysis anyway and see how the results look. Whether you think this is right or wrong, no one ever throws out their samples. Thus, all that FastQC-related discussion in the middle just wasted everyone's time.

I don't mean to imply FastQC is not a good idea. It most definitely is. However, in the real world, most of the time it only leads to extra problems for everyone involved. That is why some people would discourage it.

ADD COMMENT • link 7.2 years ago by igor 13k

0

Entering edit mode

Hi @igor. Thank you very much for sharing your point of view. I see how potential frustrations, disappointment and blaming can arise when looking at FastQC metrics, and thanks for pointing out why some people would discourage FastQC. My colleague and I are collaborators and we happen to be in different geographical locations. My colleague works on the sequencing and I work on the analysis of the data. Unfortunately my colleague does not have access to BaseSpace (long story), and unless I run FastQC on the data, the only reference my colleague has of the quality of a run was is the report and visualizations that the MiSeq instrument itself generates. Sometimes my colleague expects me to just trust that what he sent me is good. In my case, I take FastQC's passing/failing marks with a grain of salt. I prefer to critically assess FastQC metrics and then make decisions on how to pre-process my data, namely by trimming or masking low quality bases and/or removing reads which average quality is lower than a certain score threshold. But yes, I can see how some people could see this as a waste of time.

Unless the sequencing is a complete failure (in which case, we wouldn't send the data in the first place), the final consensus will be to run the analysis anyway and see how the results look. Whether you think this is right or wrong, no one ever throws out their samples. Thus, all that FastQC-related discussion in the middle just wasted everyone's time.

We often use this approach as well, and it is the sensible thing to do in many cases. I agree: it is very rare that we throw out the results of a run, unless the sequencing was terrible.

ADD REPLY • link 7.2 years ago by ropolocan ▴ 810

1

Entering edit mode

I agree that you shouldn't trust him (or anyone, really) that the data is good. Do you really need to convince him, though?

I would say the best reason to use FastQC is to be able to have comparable metrics before and after applying various filters (trimming, removing reads, etc.). Even if your colleague is doing excellent QC of the raw reads, he is not checking them at later stages. The best explanation is that you are evaluating your own filters, not double-checking his QC.

ADD REPLY • link 7.2 years ago by igor 13k

0

Entering edit mode

You have made an excellent point, @igor. I completely agree. Having comparable metrics before and after applying different pre-processing methods is an excellent reason of using FastQC. Thank you very much for your answer.

ADD REPLY • link 7.2 years ago by ropolocan ▴ 810

score 3 · Accepted Answer · 2017-02-04

3

Entering edit mode

7.2 years ago

Brian Bushnell 20k

I think that FastQC is a nice tool for graphically displaying various aspects of a library's quality. Several times, when a person was experiencing analysis difficulties, I asked them to post the FastQC output to help diagnose the reasons for the problem. Presumably, if they had used FastQC in the first place, they could have avoided some of the time spent on trying to do the analysis and initially failing.

That said, I don't consider FastQC to be sufficient alone. Here are a few things I find useful in data QC:

1) Insert-size distribution. This may let you know, for example, why the bases in the base-frequency histogram are diverging toward the read end.

2) Synthetic/spike-in contaminant metrics. For example, PhiX, molecular weight markers, etc.

3) Organism-hit metrics. E.g., the results of BLASTing 1000 reads to nt / RefSeq, or mapping all reads to custom databases of known organismal contaminants, such as human when working on non-human genomes.

2 and 3 will help you spend far less time figuring out why only 80% of your reads map if you already know 18% of your reads are Delftia.

4) True quality metrics. Illumina quality scores are not accurate; to know the quality of the data, you need to look further - e.g., map the reads and count matches/mismatches.

5) Library-complexity metrics. This is situational.

6) Kmer-frequency histogram, GC-content histogram, or even both combined. Along with 2 and 3, this can allow you to spot contamination early and decontaminate before assembling and performing an incorrect analysis.

Because I think these things are important, I've written tools to calculate most of them. They are autogenerated by our pipelines and available as graphs when an analyst wants to look at the library, which saves a lot of time. Some are generated from a random subsample of the reads to save compute time, while others (like how much PhiX or human is present) are generated as a side-effect of removing the artifact when processing all reads.

ADD COMMENT • link 7.2 years ago by Brian Bushnell 20k

2

Entering edit mode

@Brian raised most of the points that I was going to make - including the use of his tools in addition to FASTQC :-). SAV only indicates if the data quality are good, and if the sample contains in-register low-complexity sequence such as adapter dimers. It does not detect contamination, or short reads that contain adapter tails (at least until it falls off the adapter ends), or modest poly(A) contamination in RNA-Seq libraries that cause some aligners to spin their wheels for days.

@igor raises a valid point about naïve users freaking out when FASTQC invariably reports kmer content failures (I don't know if I've ever had a library pass that metric), but the issue is easily addressed by forewarning users that it's going to happen.

ADD REPLY • link 7.2 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Thank you very much for your answer, @Brian Bushnell. I agreee, FASTQC is not sufficient. Thank you very much for the very useful tips that you have provided. Are the tools you are referring to part of BBTools?

ADD REPLY • link 7.2 years ago by ropolocan ▴ 810

2

Entering edit mode

Yes, they are, mostly. To be precise, for Illumina reads, JGI uses:

1) BBDuk for adapter-removal.

2) BBDuk (in a second pass) for synthetic contaminant and PhiX removal, and quality-trimming and filtering.

3) BBMap for removing common microbial contaminants (depends on the pipeline; you may not want to do this for metagenomes, or when intentionally sequencing E.coli).

4) BBMap for removing human, cat, dog, and mouse reads, which are common contaminants (the specific settings are only recommended for non-vertebrate sequencing).

5) BBMerge for insert-size distribution calculation.

6) On the filtered and trimmed reads, subsampled, BLAST versus nt and several other databases for determining contamination levels, and whether the sample is even the correct organism.

JGI does not actually use FastQC, partly because we have proprietary tools that replicate much of its functionality, partly due to NIH syndrome, and partly because it's not perfect. That said, our tools don't replicate all of its functionality, and I think FastQC is quite useful. I will look into the release process for JGI's QC tools, because I really think they would also be useful to the community as a supplement to existing tools like FastQC. BBTools are already available, since I spent a lot of time positively interacting with Berkeley's legal department, but the others are not.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k