Question: k-mer overrepresentation of WGS Illumina reads
gravatar for Anand Rao
3.3 years ago by
Anand Rao330
United States
Anand Rao330 wrote:

I think I have obtained some weird k-mer patterns after Trimmomatic based trimming of adapters and trimming based on Quality Scores. I wonder if you forum members also consider this weird OR if it is not (much of) a problem at all. I'd like to know if you think that I may proceed with the next step of my analysis pipeline - Error Correction for singleton k-mers using ALLPATHS-LG. Additional info below, please read on:


Post-Trimmomatic FastQC report on k-mer overrepresentation - Trimmomatic Syntax:

java -jar trimmomatic-0.33.jar PE -threads 6 -phred33 -basein EthFoc-11.S285_L007.1.txt -baseout EthFoc-11_S285_L007_trimmomatic5 -trimlog EthFoc-11_trimmomatic5.log ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:true LEADING:3 TRAILING:3

After Trimmomatic, there is k-mer over-representation as shown here. Because I thought this might go away if I clip away the first 10nt away, I performed another Trimmomatic run, this time with HEADCROP:10, so this truncates 10nt at 5' end irrespective of other downstream parameters in the syntax. Post-Trimmomatic FastQC report on k-mer overrepresentation - Trimmomatic modified Syntax:

java -jar trimmomatic-0.33.jar PE -threads 6 -phred33 -basein EthFoc-11.S285_L007.1.txt -baseout EthFoc-11_S285_L007_trimmomatic6 -trimlog EthFoc-11_trimmomatic6.log ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:true HEADCROP:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:24 MINLEN:54

But this does not seem to fix anything, only makes it look worse than before. Am I not interpreting these FastQC results properly?

NEXTflex-96 DNA Barcodes from BiooScientific were used for multiplexing. The instrument used was HiSeq 4000 and the NGS library prep was using Kapa HyperPlus Library Preparation Kit. And my understanding is that the sequencing center had already de-multiplexed these Illlumina files that I am not working with.

Here are couple of links where such a problem is discussed on Biostars. How to remove kmer profiles? - but for RNASeq, so not directly relevant to my WGS samples fastqc: kmer-content failed - k-mer overrepresentation looks quite similar to mine - one source of this is listed to be random priming. Again, mine is not RNASeq sample, but WGS sample. Importantly, I think there is no random priming and no PCR step during Kapa HyperPlus Library Preparation. So I am puzzled WHY k-mers are overrepresented the way they are? In the second post linked above, I did notice how the scale is very different (18), but even in my analyses, first time log2(Obs/Exp) gets worse from ~5 to ~8...

Since I am new to NGS Illumina sequence QC, any & all advice, education, suggestions are welcome.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Anand Rao330

If this is plain genome sequencing/re-sequencing you should not get hung up on k-mer over-representation. As long as you have assured yourself that the data is clean of adapters/extraneous sequence go on to the next step in analysis.

ADD REPLYlink written 3.3 years ago by genomax92k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1937 users visited in the last hour