FASTQC: overrepresented sequences
2
0
Entering edit mode
3.1 years ago
kuilin • 0

Hi all, I have encounter some problem while trying to trim with cutadapt.

After trimming of the adaptor sequence and running fastqc, I got a warning for overrepresenting sequences:

Sequence:
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Count: 55673

Percentage: 0.12298334311639213

I then do another round of cutadapt to remove the "G"s and I got an overrepresenting sequences that shows up empty, with count of 88854, and percentage of 0.19600626265311266.

Then, I tried trimming the original sequence again with trimgalore, and basically got the same overrepresenting sequence warning as the first cutadapt attempt, showing many "G"s.

My question is, what have I done wrongly with the cutadapt? And whether I should trust the results from trimgalore?

RNA-Seq • 3.3k views
ADD COMMENT
0
Entering edit mode

0.12298334311639213% => just ignore it, it is barely an issue.

ADD REPLY
2
Entering edit mode
3.1 years ago
GenoMax 141k

PolyG = No signal for 2-color chemistry Illumina sequencers. That sequence is not real. If you check the corresponding Q scores they will be poor.

ADD COMMENT
0
Entering edit mode

First one may be correct. Second one (low quality) may not be true as per this article: https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/. Mentioning the same, cutadapt devs has a possible solution for two color chemistry with over representation of Gs under section Quality trimming of reads using two-color chemistry (NextSeq), possibly ignoring quality values for G bases. OP needs to check the index for sequences with high G counts and also position of G repeats (probably ends)

ADD REPLY
0
Entering edit mode

If these are mixed reads (where there is some real sequence followed by nothing) then that will be true. My assumption based on example posted by OP is that these are empty clusters producing no signal.

bbduk.sh from BBMap suite also offers options to deal with the polyG's.

trimpolygleft=0     If greater than 0, trim poly-G prefixes of at least this
                    length on the left end of reads.  Does not trim poly-C.
trimpolygright=0    If greater than 0, trim poly-G tails of at least this
                    length on the right end of reads.  Does not trim poly-C.
trimpolyg=0         This sets both left and right at once.
filterpolyg=0       If greater than 0, remove reads with a poly-G prefix of
                    at least this length (on the left).
ADD REPLY
0
Entering edit mode

My understanding is that whether mixed read or isolated reads (reads with runs of G), Gs are assigned high quality esp tails of these reads. I hope OP posts the Q scores for these reads. In general (as in my experience), most users use q 20 or 30 for q cutoff along with adapter trimming and at this point, these reads should go away assuming that they have low q scores. If one doesn't supply q value, default q cutoff for cutadapt should filter the low quality reads (for ef fastp default q cutoff is 15). QCfail site shows that average q values for first index reads with G runs are typically high.

ADD REPLY
0
Entering edit mode
3.1 years ago
plberry ▴ 30

CutAdapt by default leaves the sequence IDs and blank lines for the base-call/quality scores in the output, even if all of the bases have been removed. You can alter this behavior by using -m [minimum number of bases] when you run CutAdapt. https://cutadapt.readthedocs.io/en/stable/guide.html#filtering

ADD COMMENT
0
Entering edit mode

There were back and forth discussions on this and final implementation is here: https://github.com/marcelm/cutadapt/issues/428 and probably, you are referring to the issue here: Error while using the Cutadapt 2.6 ouput fastq file as input for alignment Rsubread

ADD REPLY

Login before adding your answer.

Traffic: 3131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6