Question: FastQ quality check : what can we correct ?
gravatar for Rad
6.3 years ago by
Rad800 wrote:

Hello there,

I am having a ot of MiSeq data I am trying to analyze and I figured out using FastQC that I have a lot of fails in the report it generates and I wanted to pick your brain to get sense of what should be done in that case.

As you all know FastQC generates this kind of informations :

[PASS] Basic Statistics
[FAIL] Per base sequence quality
[PASS] Per sequence quality scores
[FAIL] Per base sequence content
[FAIL] Per base GC content
[WARNING] Per sequence GC content
[PASS] Per base N content
[WARNING] Sequence Length Distribution
[FAIL] Sequence Duplication Levels
[WARNING] Overrepresented sequences
[FAIL] Kmer Content


The issue here is that I am analyzing targeted sequencing data, so I am expecting to have a lot of duplications, what I don't clearly see is whether or not to take the result of FastQC as correct based on the standard they are publishing on their website (how a good report should look like and how a bad one should look like), so I am expecting the GC content to go crazy with the amount of duplication because of the type of experiment (deep sequencing), now based on the information provided in the example above, do you think fastq post processing like clipping and trimming would correct the reads or is it failing in the level of the MiSeq machine already (experimental contamination ?)


qc fastq • 13k views
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Rad800
gravatar for Sean Davis
6.3 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

I would use the FastQC results directly (the plots) and comparatively (between lanes/runs of the same type of experiment) and not rely heavily on the PASS/FAIL stuff.  In general, trimming can often help things, but for targeted sequencing, the aligners do a fairly good job of soft clipping adapters and low-quality bases anyway.  The better metric, in my opinion, is to do alignment and see what you get.  A perfectly good FastQC result can lead to very poor mapping (sequenced wrong organism, for example).  Also, a pretty poor FastQC result often results in usable data.  Where FastQC can be quite helpful is in determining whether sequencing the same library will be an effective way of increasing depth.  If the FastQC results suggest that there are library issues, perhaps making a new library is warranted.  

ADD COMMENTlink written 6.3 years ago by Sean Davis26k

Thanks Sean, What I do have indeed is a per base sequence quality dropping to < 28 on 200pb read around the position 100-120 bp so the flag that this raises is whether or not it is a primer problem or not, besides clipping the sequences to 50% of their length looks a bit brutal to me, don't you think ?

ADD REPLYlink written 6.3 years ago by Rad800

I have seen only one run in 6 years that required clipping based on the base pair location.  If you want to clip, do so based on quality, not base pair.  Also, keep in mind that those plots need to be read carefully, as even pretty bad plots typically have a large proportion of the reads with perfectly acceptable quality scores.

ADD REPLYlink written 6.3 years ago by Sean Davis26k

I see, 

Thanks for the tips, I am going to use the data but will play with the quality threshold at the alignment level




ADD REPLYlink written 6.3 years ago by Rad800
gravatar for Dan D
6.3 years ago by
Dan D7.1k
Dan D7.1k wrote:

I consider FastQC as a screening program for rapidly getting QC info; I don't see it as a definitive QC analysis tool, and it's not designed to be that. It pulls a subset (10% IIRC) of the reads in the file and works with those. It's designed to be very picky and give false positives. With a yellow or red flag, FastQC is basically saying "Yo! Look at this! Might be a problem in a general context." If you see that the result is in fact what you expect given the experimental context, then it's probably safe to proceed.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Dan D7.1k

Indeed, +1 for the 'Yo !' :) what worries me though is that all the samples I am preparing to analyze show a strong GC bias and a per sequence quality drop around the half of the sequences length, which is more than just a warning even though it is a deep seq experiment and even though it is just a representation of a 10% subset

ADD REPLYlink written 6.3 years ago by Rad800
gravatar for rtliu
6.3 years ago by
New Zealand
rtliu2.1k wrote:

Judging from your 10 reporting items, you are not using the latest version of FastQC Version 0.11.2.  You are missing two more reporting items:

  • [Per tile sequence quality]
  • [Adapter Content]

Even a green tick for [Adapter Content] graph gave me a clear indication that I need to trim the adapters on my recent HiSeq 2000 data.

After trimming with, the read trail disappeared








ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by rtliu2.1k

Currently the newest version is 0.11.5 although there are no new metrics on top of yours.

ADD REPLYlink written 3.9 years ago by Ömer An200
gravatar for Rad
6.3 years ago by
Rad800 wrote:

Here is an update with some plots, I am a bit confused since I have this trend coming from the sequencing facility quite often now, I wonder if there is an experimental problem that should be reported, this is a MiSeq analysis of a single cell

Note how the read quality quickly drops in the middle of the reads and how the GC content distribution is shifting, I am also annoyed by the sequence length distribution, it seems that there is a lot of short sequences !!


any comments ?

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Rad800
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2073 users visited in the last hour