Question

Several failures present in QC of RNA-seq data from company via fastqc

0

Entering edit mode

4.1 years ago

citronxu ▴ 20

Hello everyone,

Days ago I just got RNA-seq data which is Illumina paired-end library where reads are 150bp in length, and did quality control via Fastqc (version 0.11.5), finding though the sequencing quality is really good there are several failures within the results mainly on three aspects: 'Per base sequence content', 'Sequence duplication level' and 'Kmer content'. (The detail of my failure I'll present below as pics)

Then I realized that among all failures, the consensus is always the first nine to ten nt causing the quality dropping down dramatically with results on following nt position being quite good. So I raised a hypothesis, if the raw data received from company would be yet going through adapter-trimming process? Afterwards, I run trimmomatic to see what would be gonna happen to the reads. Surprisingly when I run Fastqc on its outcome there is big change on the final results where parts of my reads were clipped by 3 nt ending up with 147bp in length.

Here is my question, how can it happen some failures occurring in my datasets even under such a good quality? And does the quality report of trimming tell actually the original data from the company has already been trimmed and could be used for downstream analysis?

originalData_QC_BasicStats originalData_QC_Quality originalData_QC_PerBaseSeqContent originalData_QC_DupLevel originalData_QC_kmer TrimmedData_QC_BasicStats codeUsed_forTrimming Outcome_Trimming

RNA-Seq • 1.0k views

ADD COMMENT • link 4.1 years ago by citronxu ▴ 20

0

Entering edit mode

Thanks a lot!!! perfectly solved my question.

ADD REPLY • link 4.1 years ago by citronxu ▴ 20

0

Entering edit mode

do note the error message, you are not actually trimming the adapters,

trimmomatic will chug along even when the file is missing then cheerfully reports "completed successfully" when in fact there was a major error of a missing adapter file

in general I would recommend running the SLIDINGWINDOW operation,

it really makes no sense to to do TRAILING 3 a quality of 3 is just as bad as quality 4, 5 or 6 - you should do a sliding window say 30 average over 4-5 bases or something similar..

ADD REPLY • link 4.1 years ago by Istvan Albert 100k

score 2 · Accepted Answer · 2020-03-23

Please read the following blog posts from authors of FastQC:

https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/
https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/

There are rarely true failures on FastQC. Instead your data is not within the bounds of intervals that FastQC uses. These are defined in a file and can be changed. Default values are set for genomic sequence. Having a red X does not mean that your data is automatically bad. You have to consider the experimental context and in most cases you would be able to go forward with the rest of the analysis.