Question: Fastqc Over-Represented Sequences Are Adapters ?
2
gravatar for sckinta
5.9 years ago by
sckinta550
United States
sckinta550 wrote:

I evaluated the quality of RNAseq data by fastqc and found that quality of sequences were not so good for following analysis. BUT, there are no over-represented sequences in quality report. As those data has been parsed by others before, I was told to remove the adapter sequences and low-quality sequences first, and then do quality evaluation. I was wondering whether adapter clipper will make fastqc report better in case that no over-represented sequences were detected by fastqc. In other words, the over-represented sequences detected by fastqc are adapters ?

Here is my fastqc command line

fastqc -o ST_read1_fastqc --contaminants  TruSeq2-PE.txt -noextract  ST_read1.fastq
fastx fastqc • 18k views
ADD COMMENTlink modified 5.9 years ago by Rohit1.4k • written 5.9 years ago by sckinta550

There can be bunch of other sequences that can be over-represented other than adapters. But if you didn't find any over-represented sequences this means that adapters have already been trimmed off.

ADD REPLYlink written 5.9 years ago by Ashutosh Pandey11k

SO adaptors are a kind of over-represented sequences and should be able to be detected by fastqc if presented in seq data? The interesting thing is that I run the fastx_clipper on each adaptor in TruSeq2-PE.txt. There are plenty of sequences discarded because they are too short after trimming. The inconsistency between the fastqc and fastx toolkit makes me very confused the concept of over-represented sequences.

ADD REPLYlink written 5.9 years ago by sckinta550
7
gravatar for Rohit
5.9 years ago by
Rohit1.4k
California
Rohit1.4k wrote:

The over-represented sequences in RNAseq might be adapters, the polyA or any contamination that can be amplified before you sequence it. I think that in your case, if the over-represented ones were not the adapters then they just might be the contaminants.

If I get it right, your doubt about the over-represented sequences might be solved through this link - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Overrepresented%20Sequences.html

The documentation will solve other questions about the FastQC output. There is a description for each of the outputs. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

Always keep in mind that FastQC checks for the first 200,000 sequences and then extrapolates it based on your read information.

I've heard of instances of Fastx toolkit trimming more than just the adapter. Cutadapt, Trim Galore or Trimmomatic can be tried too, but this always depends on your data. It would help very much if you knew the adapter sequences in advance.

Check the Biostars pages below on adapter trimming -

Fastx clipper not accurate in adapter clipping for 100nt reads

How to best deal with adapter contamination (Illumina)?

trimming adapters

Hope this helps :)

ADD COMMENTlink written 5.9 years ago by Rohit1.4k

The links are very useful. Thank you very much~~

ADD REPLYlink written 5.9 years ago by sckinta550
1
gravatar for Istvan Albert
5.9 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

Well remove the adapters and check that task off the list.

If indeed very tiny amount of data is affected make a note and refer to that note later.

BTW, fastqc will only report enrichment if it exceeds a certain limit.

ADD COMMENTlink written 5.9 years ago by Istvan Albert ♦♦ 81k
1

I run the fastx_clipper on each adaptor in TruSeq2-PE.txt yesterday. a lot of sequences were discarded because they were too short after trimming. The inconsistency between the fastqc and fastx toolkit makes me very confused the concept of over-represented sequence

ADD REPLYlink written 5.9 years ago by sckinta550
2

oh wait, over represented sequences may have nothing to do with the adapters. Read the documentation on what actually happens:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

the overrepresented sequence count is performed only on the first 200K sequences moreover they need to be entirely be present and that may not be true for adapters.

the overrepresented kmers is a better indicator of contamination, but the real test is what happens after trimming

ADD REPLYlink written 5.9 years ago by Istvan Albert ♦♦ 81k
1

The overrepresented sequences can be adapters, but only generally if you have adapter dimers (which is pretty common). But if adapters are not at the ends, they can be represented in the kmer content.

And if there are no exact 20 base matches too, the most similar ones will be checked. So as Istvan said, less chances for adapter occurrence, more for contamination.

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Rohit1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1461 users visited in the last hour