Question

FastQ quality check : what can we correct ?

8

Entering edit mode

9.8 years ago

Rad ▴ 810

Hello there,

I am having a lot of MiSeq data I am trying to analyze and I figured out using FastQC that I have a lot of fails in the report it generates and I wanted to pick your brain to get sense of what should be done in that case.

As you all know FastQC generates this kind of information:

[PASS] Basic Statistics
[FAIL] Per base sequence quality
[PASS] Per sequence quality scores
[FAIL] Per base sequence content
[FAIL] Per base GC content
[WARNING] Per sequence GC content
[PASS] Per base N content
[WARNING] Sequence Length Distribution
[FAIL] Sequence Duplication Levels
[WARNING] Overrepresented sequences
[FAIL] Kmer Content

The issue here is that I am analyzing targeted sequencing data, so I am expecting to have a lot of duplications, what I don't clearly see is whether or not to take the result of FastQC as correct based on the standard they are publishing on their website (how a good report should look like and how a bad one should look like), so I am expecting the GC content to go crazy with the amount of duplication because of the type of experiment (deep sequencing), now based on the information provided in the example above, do you think fastq post processing like clipping and trimming would correct the reads or is it failing in the level of the MiSeq machine already (experimental contamination?)

Rad

qc fastq • 15k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Rad ▴ 810

2

Entering edit mode

9.8 years ago

rtliu ★ 2.2k

Judging from your 10 reporting items, you are not using the latest version of FastQC Version 0.11.2. You are missing two more reporting items:

[Per tile sequence quality]
[Adapter Content]

Even a green tick for [Adapter Content] graph gave me a clear indication that I need to trim the adapters on my recent HiSeq 2000 data.

After trimming with Trimmomatic, the read trail disappeared

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by rtliu ★ 2.2k

0

Entering edit mode

Currently the newest version is 0.11.5 although there are no new metrics on top of yours.

ADD REPLY • link 7.4 years ago by Ömer An ▴ 260

0

Entering edit mode

9.8 years ago

Rad ▴ 810

Here is an update with some plots, I am a bit confused since I have this trend coming from the sequencing facility quite often now, I wonder if there is an experimental problem that should be reported, this is a MiSeq analysis of a single cell

Note how the read quality quickly drops in the middle of the reads and how the GC content distribution is shifting, I am also annoyed by the sequence length distribution, it seems that there is a lot of short sequences!!

Any comments?

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Rad ▴ 810

Ram · Accepted Answer · 2014-07-10

7

Entering edit mode

9.8 years ago

Sean Davis 26k

I would use the FastQC results directly (the plots) and comparatively (between lanes/runs of the same type of experiment) and not rely heavily on the PASS/FAIL stuff. In general, trimming can often help things, but for targeted sequencing, the aligners do a fairly good job of soft clipping adapters and low-quality bases anyway. The better metric, in my opinion, is to do alignment and see what you get. A perfectly good FastQC result can lead to very poor mapping (sequenced wrong organism, for example). Also, a pretty poor FastQC result often results in usable data. Where FastQC can be quite helpful is in determining whether sequencing the same library will be an effective way of increasing depth. If the FastQC results suggest that there are library issues, perhaps making a new library is warranted.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Sean Davis 26k

0

Entering edit mode

Thanks Sean, What I do have indeed is a per base sequence quality dropping to < 28 on 200pb read around the position 100-120 bp so the flag that this raises is whether or not it is a primer problem or not, besides clipping the sequences to 50% of their length looks a bit brutal to me, don't you think?

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Rad ▴ 810

0

Entering edit mode

I have seen only one run in 6 years that required clipping based on the base pair location. If you want to clip, do so based on quality, not base pair. Also, keep in mind that those plots need to be read carefully, as even pretty bad plots typically have a large proportion of the reads with perfectly acceptable quality scores.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Sean Davis 26k

0

Entering edit mode

I see,

Thanks for the tips, I am going to use the data but will play with the quality threshold at the alignment level

Thanks
Rad

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Rad ▴ 810

score 5 · Accepted Answer · 2014-07-10

5

Entering edit mode

9.8 years ago

Dan D 7.4k

I consider FastQC as a screening program for rapidly getting QC info; I don't see it as a definitive QC analysis tool, and it's not designed to be that. It pulls a subset (10% IIRC) of the reads in the file and works with those. It's designed to be very picky and give false positives. With a yellow or red flag, FastQC is basically saying "Yo! Look at this! Might be a problem in a general context." If you see that the result is in fact what you expect given the experimental context, then it's probably safe to proceed.

ADD COMMENT • link 9.8 years ago by Dan D 7.4k

0

Entering edit mode

Indeed, +1 for the 'Yo !' :) what worries me though is that all the samples I am preparing to analyze show a strong GC bias and a per sequence quality drop around the half of the sequences length, which is more than just a warning even though it is a deep seq experiment and even though it is just a representation of a 10% subset

ADD REPLY • link 9.8 years ago by Rad ▴ 810