Interpretation of FASTQC results - do I need to trim my sequence?
3
1
Entering edit mode
5.3 years ago

I am quality checking a gene sequence using FASTQC.

It has given me warnings about the sequence duplication (44%), per base sequence content and kier content.

Should I trim this sequence using trim galore or cut adapt to rectify these problems??

I tried this using q-30 and removing the adapter sequence AGATCGGAAGAGC however this returned my paired sequences with even more warnings with fails on per base sequence content and GC content.

So I am not sure how to proceed as whilst fixing the warnings of the oriingal sequence I have created new problems by trimming it.

Any help/advice would be appreciated :)

Assembly sequencing fastqc trim illumina • 5.2k views
ADD COMMENT
4
Entering edit mode
5.3 years ago
GenoMax 104k

Take a look at this set of blog posts by Dr. Simon Andrews (author of FastQC) and see if they answers some of your questions. This post may be directly relevant in this case.
Please don't trim data based on Q-scores unless you have some really bad data (Q10 or less). Otherwise you would be throwing away good data.

ADD COMMENT
1
Entering edit mode
5.3 years ago
chen ★ 2.1k

Duplication of 44% is not very high if you were doing some kind of deep sequencing (>200x).

And trimming of the read's head/tail is needed for the cases you want to reduce false positive mutations (especially low frequency somatic mutations), try https://github.com/OpenGene/after, which can do automatic Filtering, Trimming and Error Removing

ADD COMMENT
0
Entering edit mode
5.3 years ago

Thank you very much! Is it a problem that by sequence duplication is very high (44%)?

ADD COMMENT
0
Entering edit mode

Does the following from your original post mean that you are looking at just one gene (amplicon sequencing)?

I am quality checking a gene sequence using FASTQC


If that is the case then you would expect a lot of duplication. Have you scanned your data with a trimming program to ensure that there is no adapter contamination?

ADD REPLY
0
Entering edit mode

Sorry I meant to say a genome sequence.

TGCTG is a sequence identified in the 'K mer content' with count 752200.

ADD REPLY

Login before adding your answer.

Traffic: 1874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6