Question

Questions regarding proprocess for raw data and usage of FastQC

1

Entering edit mode

8.9 years ago

mangfu100 ▴ 800

Hi all

I downloaded some paired end sequencing files as a fastq format.

I would like to make sure that there are no hidden problems which might be more difficult to detect at a later state, so I ran FastQC to check their sequence quality before analyzing them.

Unfortunately, there are some bad conditions in some samples. Especially, per base sequence content and sequence duplication levels and adapter content are common bad categories I mainly encountered.

I heard that there are some tools that do removing reads that are bad or biased so I tried to search and finally found that "Trimmomatic" seemed to what I am looking for. However, after I examined their manual in detail, that is not for my case. This is because my problems is mainly focused on sequence duplication levels, per base sequence content,kmer content not a base quality or related sequence quality something.

Therefore I need to find another tools that suits for my case. but I didn't.

Can you suggest or recommend any preprocessing-related tools for me?

Especially in case of resolving sequence duplication level or per base sequence content.

sequence next-gen-sequencing • 3.4k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by mangfu100 ▴ 800

2

Entering edit mode

Regarding the duplications, I guess that you are getting to many overrepresented-sequences? If you are interested in removing them, you can extract these sequences from fastQC report, and then filter them with Trimmomatic. But I'm not sure if you want to do this, normally I only remove adapter sequences, not all overrepresented sequences (polyA...). Regarding base sequence content, it is normal to not have a perfect distribution. If you're working with RNASeq data, its common to have a bias at the beginning of the read, you can remove the first 10bp of the reads with Trimmomatic to fix it.
I usually don't care to much about these parameters, adapter content and per base quality are the most critical ones.

ADD REPLY • link 8.9 years ago by iraun 6.2k

0

Entering edit mode

Thank you for your comments.

I am using whole-exome sequencing for my study and I have some questions for you about your reply.

I am getting too many overrepresented-sequences by looking at the sequence duplication levels. but this parameter only give distributions, so I don't have any way to get the its sequences.

Anyway, most critical conditions are adapter content and per base quality as you mentioned, and in my cases, I have only problems with adapter content, so is it solution to resolve adapter content by using Trimmomatic?

Actually I am not using Trimmomatic, so I am not sure this tool gives me solutions for my case.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by mangfu100 ▴ 800

0

Entering edit mode

You can know the sequences looking at the "Overrepresented sequences" plot. You can see the sequence, how many times the sequence appears, the percentage and the source. In this last field, you can see which adapters were used in your library preparation. If the overrepresented sequence is an adapter, most probably you'll have a tag indicating it (TruSeq, illumina...). So, yes, you can use Trimmomatic to remove them. When you download Trimmomatic, there is a file called adapters.fa (or something similar), which is a fasta file containing most of the adapters used in sequencing (most probably your adapters are included here). You can give this file to Trimmomatic using ILLUMINACLIP argument, and it will look for each of the adapters in adapters.fa file in your fq files and remove them. Another possibility is to create your own adapters.fa file. You can extract your specific adapter sequences from fastQC overrepresented sequences, create a fasta file with them, and give it to Trimmomatic.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by iraun 6.2k

0

Entering edit mode

Thanks.

I am so curious because there are bad conditions for adapter content while fine with over-represented sequences. So I didn't get any sequences related to adapter :(

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by mangfu100 ▴ 800

Ram · Answer 1 · 2015-05-29

2

Entering edit mode

8.9 years ago

Zaag ▴ 860

Have a look at PrinSEQ: http://prinseq.sourceforge.net/manual.html

It gives QC plots and has options to trim or filter reads based on duplication level or repetitive sequence (and of course quality and all that)

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Zaag ▴ 860