How to repair *all* problems identified by FastQC?
2
0
Entering edit mode
6.4 years ago
dec986 ▴ 370

hello,

I am downloading public data, and am running FastQC on a number of FASTQ files I've downloaded. I get reports like this:

PASS    Basic Statistics    SRR2637682_1.fastq.bz2
PASS    Per base sequence quality   SRR2637682_1.fastq.bz2
PASS    Per tile sequence quality   SRR2637682_1.fastq.bz2
PASS    Per sequence quality scores SRR2637682_1.fastq.bz2
FAIL    Per base sequence content   SRR2637682_1.fastq.bz2
FAIL    Per sequence GC content SRR2637682_1.fastq.bz2
PASS    Per base N content  SRR2637682_1.fastq.bz2
PASS    Sequence Length Distribution    SRR2637682_1.fastq.bz2
FAIL    Sequence Duplication Levels SRR2637682_1.fastq.bz2
WARN    Overrepresented sequences   SRR2637682_1.fastq.bz2
PASS    Adapter Content SRR2637682_1.fastq.bz2
FAIL    Kmer Content    SRR2637682_1.fastq.bz2

I've read about lots of quality control tools that can fix some of these problems. However, I cannot find one that works properly and generates a "PASS" for all of these.

For example, I have absolutely no idea how I can fix the "Kmer content" module, all I know is that it has always shown a FAIL in every real example I've seen.

All I can find are trimmers and adapter removers, which don't improve most of the modules here. For example, "Per base sequence content" I have no idea how to fix this, all I know is that it's always FAIL.

FastQC doesn't actually fix anything, how can I go about fixing all of these modules? are there some that okay to fail?

RNA-Seq FastQC • 6.8k views
ADD COMMENT
3
Entering edit mode

Some "problems" are not problems. For example:

  • FastQC will flag fail for most RNAseq libraries, because its assumption for fail is genomic library.
  • Illumina TruSeq RNAseq library will always flag fail for per base sequence content

You have to take FastQC warnings and fails with a grain of salt, taking into account the nature of the samples being analysed.

P.S.: added link for post discussing TruSeq hexamer priming problem.

ADD REPLY
1
Entering edit mode

Nextera genomic libraries also fail the "per base sequence content", at least they did a few years back.

I believe that was because of some residual transposase bias in the first 10-15 bp.

ADD REPLY
7
Entering edit mode
6.4 years ago
novice ★ 1.1k

Easy: You download the tool FixReadsForGood.pl and select option --no-more-worries.

Just kidding!

Yes, there are usually some warnings that you can ignore. Quality control is entirely based on your knowledge of the sequences and your purposes. In my opinion, people more often than not unnecessarily filter/trim and lose information.

ADD COMMENT
2
Entering edit mode
6.4 years ago
Ian 6.0k

A good way to solve the errors (taking into account what the other said about their relevance) is to run the reads through a trimming tool, such as Trimmomatic, cutadapt, etc. Not only will poor quality reads/bases be removed, but also adapters. Often rerunning fastqc will show a vast improvement.

Also, take a read of the excellent QCfail.

ADD COMMENT

Login before adding your answer.

Traffic: 1440 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6