Question

Interpretation of FASTQC results - do I need to trim my sequence?

1

Entering edit mode

9.3 years ago

sallyanndunn18 ▴ 10

I am quality checking a gene sequence using FASTQC.

It has given me warnings about the sequence duplication (44%), per base sequence content and kier content.

Should I trim this sequence using trim galore or cut adapt to rectify these problems??

I tried this using q-30 and removing the adapter sequence AGATCGGAAGAGC however this returned my paired sequences with even more warnings with fails on per base sequence content and GC content.

So I am not sure how to proceed as whilst fixing the warnings of the oriingal sequence I have created new problems by trimming it.

Any help/advice would be appreciated :)

Assembly sequencing fastqc trim illumina • 7.2k views

ADD COMMENT • link updated 9.3 years ago by chen ★ 2.5k • written 9.3 years ago by sallyanndunn18 ▴ 10

score 4 · Answer 1 · 2016-03-28

Take a look at this set of blog posts by Dr. Simon Andrews (author of FastQC) and see if they answers some of your questions. This post may be directly relevant in this case.
Please don't trim data based on Q-scores unless you have some really bad data (Q10 or less). Otherwise you would be throwing away good data.

score 1 · Answer 2 · 2016-03-28

Duplication of 44% is not very high if you were doing some kind of deep sequencing (>200x).

And trimming of the read's head/tail is needed for the cases you want to reduce false positive mutations (especially low frequency somatic mutations), try https://github.com/OpenGene/after, which can do automatic Filtering, Trimming and Error Removing

score 0 · Answer 3 · 2016-03-28

0

Entering edit mode

9.3 years ago

sallyanndunn18 ▴ 10

Thank you very much! Is it a problem that by sequence duplication is very high (44%)?

ADD COMMENT • link 9.3 years ago by sallyanndunn18 ▴ 10

0

Entering edit mode

Does the following from your original post mean that you are looking at just one gene (amplicon sequencing)?

I am quality checking a gene sequence using FASTQC

If that is the case then you would expect a lot of duplication. Have you scanned your data with a trimming program to ensure that there is no adapter contamination?

ADD REPLY • link 9.3 years ago by GenoMax 152k

0

Entering edit mode

Sorry I meant to say a genome sequence.

TGCTG is a sequence identified in the 'K mer content' with count 752200.

ADD REPLY • link 9.3 years ago by sallyanndunn18 ▴ 10