Question

Fastqc Over-Represented Sequences Are Adapters ?

2

Entering edit mode

10.5 years ago

sckinta ▴ 730

I evaluated the quality of RNAseq data by fastqc and found that quality of sequences were not so good for following analysis. BUT, there are no over-represented sequences in quality report. As those data has been parsed by others before, I was told to remove the adapter sequences and low-quality sequences first, and then do quality evaluation. I was wondering whether adapter clipper will make fastqc report better in case that no over-represented sequences were detected by fastqc. In other words, the over-represented sequences detected by fastqc are adapters ?

Here is my fastqc command line

fastqc -o ST_read1_fastqc --contaminants  TruSeq2-PE.txt -noextract  ST_read1.fastq

fastqc fastx • 27k views

ADD COMMENT • link updated 10.5 years ago by Rohit ★ 1.5k • written 10.5 years ago by sckinta ▴ 730

1

Entering edit mode

There can be bunch of other sequences that can be over-represented other than adapters. But if you didn't find any over-represented sequences this means that adapters have already been trimmed off.

ADD REPLY • link 10.5 years ago by Ashutosh Pandey 12k

0

Entering edit mode

SO adaptors are a kind of over-represented sequences and should be able to be detected by fastqc if presented in seq data? The interesting thing is that I run the fastx_clipper on each adaptor in TruSeq2-PE.txt. There are plenty of sequences discarded because they are too short after trimming. The inconsistency between the fastqc and fastx toolkit makes me very confused the concept of over-represented sequences.

ADD REPLY • link 10.5 years ago by sckinta ▴ 730

score 8 · Answer 1 · 2013-11-08

The over-represented sequences in RNAseq might be adapters, the polyA or any contamination that can be amplified before you sequence it. I think that in your case, if the over-represented ones were not the adapters then they just might be the contaminants.

If I get it right, your doubt about the over-represented sequences might be solved through this link - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Overrepresented%20Sequences.html

The documentation will solve other questions about the FastQC output. There is a description for each of the outputs. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

Always keep in mind that FastQC checks for the first 200,000 sequences and then extrapolates it based on your read information.

I've heard of instances of Fastx toolkit trimming more than just the adapter. Cutadapt, Trim Galore or Trimmomatic can be tried too, but this always depends on your data. It would help very much if you knew the adapter sequences in advance.

Check the Biostars pages below on adapter trimming -

Fastx clipper not accurate in adapter clipping for 100nt reads

How to best deal with adapter contamination (Illumina)?

trimming adapters

Hope this helps :)

score 1 · Answer 2 · 2013-11-08

1

Entering edit mode

10.5 years ago

Istvan Albert 100k

Well remove the adapters and check that task off the list.

If indeed very tiny amount of data is affected make a note and refer to that note later.

BTW, fastqc will only report enrichment if it exceeds a certain limit.

ADD COMMENT • link 10.5 years ago by Istvan Albert 100k

1

Entering edit mode

I run the fastx_clipper on each adaptor in TruSeq2-PE.txt yesterday. a lot of sequences were discarded because they were too short after trimming. The inconsistency between the fastqc and fastx toolkit makes me very confused the concept of over-represented sequence

ADD REPLY • link 10.5 years ago by sckinta ▴ 730

2

Entering edit mode

oh wait, over represented sequences may have nothing to do with the adapters. Read the documentation on what actually happens:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

the overrepresented sequence count is performed only on the first 200K sequences moreover they need to be entirely be present and that may not be true for adapters.

the overrepresented kmers is a better indicator of contamination, but the real test is what happens after trimming

ADD REPLY • link 10.5 years ago by Istvan Albert 100k

1

Entering edit mode

The overrepresented sequences can be adapters, but only generally if you have adapter dimers (which is pretty common). But if adapters are not at the ends, they can be represented in the kmer content.

And if there are no exact 20 base matches too, the most similar ones will be checked. So as Istvan said, less chances for adapter occurrence, more for contamination.

ADD REPLY • link 10.5 years ago by Rohit ★ 1.5k