Question: Questions regarding proprocess for raw data and usage of FastQC
gravatar for mangfu100
4.7 years ago by
Korea, Republic Of
mangfu100720 wrote:

Hi all.

I downloaded some paired end sequencing files as a fastq format.

I would like to make sure that there are no hidden problems which might be more difficult to detect at a later state, so I ran FastQC to check their sequence quality before analyzing them.

Unfortunately, there are some bad conditions in some samples. Especially, per base sequence content and sequence duplication levels and adapter content are common bad categories I mainly encountered.

I heard that there are some tools that do removing reads that are bad or biased so I tried to search and finally  found that "Trimmomatic" seemed to what I am looking for. However, after I examined their manual in detail, that is not for my case. This is because my problems is mainly focused on sequence duplication levels, per base sequence content,kmer content not a base quality or related sequence quality something.

Therefore I need to find another tools that suits for my case. but I didn't.

Can you suggest or recommend any preprocessing-related tools for me? 

especially in case of resolving sequence duplication level or per base sequence content.


sequencing sequence next-gen • 2.1k views
ADD COMMENTlink modified 4.7 years ago by Zaag770 • written 4.7 years ago by mangfu100720

Regarding the duplications, I guess that you are getting to many overrepresented-sequences? If you are interested in removing them, you can extract these sequences from fastQC report, and then filter them with Trimmomatic. But I'm not sure if you want to do this, normally I only remove adapter sequences, not all overrepresented sequences (polyA...). Regarding base sequence content, it is normal to not have a perfect distribution. If you're working with RNASeq data, its common to have a bias at the beginning of the read, you can remove the first 10bp of the reads with Trimmomatic to fix it.
I usually don't care to much about these parameters, adapter content and per base quality are the most critical ones.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by iraun3.7k

Thank you for your comments.

I am using whole-exome sequencing for my study and I have some questions for you about your reply.

I am getting too many overrepresented-sequences by looking at the sequence duplication levels. but this parameter only give distributions, so I don't have any way to get the its sequences.

Anyway,  most critical conditions are adapter content and per base quality as you mentioned, and in my cases, I have only problems with adapter content, so is it solution to resolve adapter content by using Trimmomatic? 

Actually I am not using Trimmomatic, so I am not sure this tool gives me solutions for my case.

ADD REPLYlink written 4.7 years ago by mangfu100720

You can know the sequences looking at the "Overrepresented sequences" plot. You can see the sequence, how many times the sequence appears, the percentage and the source. In this last field, you can see which adapters were used in your library preparation. If the overrepresented sequence is an adapter, most probably you'll have a tag indicating it (TruSeq, illumina...). So, yes, you can use Trimmomatic to remove them. When you download Trimmomatic, there is a file called adapters.fa (or something similar), which is a fasta file containing most of the adapters used in sequencing (most probably your adapters are included here). You can give this file to Trimmomatic using ILLUMINACLIP argument, and it will look for each of the adapters in adapters.fa file in your fq files and remove them. Another possibility is to create your own adapters.fa file. You can extract your specific adapter sequences from fastQC overrepresented sequences, create a fasta file with them, and give it to Trimmomatic.

ADD REPLYlink written 4.7 years ago by iraun3.7k


I am so curious because there are bad conditions for adapter content while fine with over-represented sequences. So I didn't get any sequences related to adapter :(


ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by mangfu100720
gravatar for Zaag
4.7 years ago by
Zaag770 wrote:

Have a look at PrinSEQ:

It gives QC plots and has options to trim or filter reads based on duplication level or repetitive sequence (and of course quality and all that)

ADD COMMENTlink written 4.7 years ago by Zaag770
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 718 users visited in the last hour