I think I have obtained some weird k-mer patterns after Trimmomatic based trimming of adapters and trimming based on Quality Scores. I wonder if you forum members also consider this weird OR if it is not (much of) a problem at all. I'd like to know if you think that I may proceed with the next step of my analysis pipeline - Error Correction for singleton k-mers using ALLPATHS-LG. Additional info below, please read on:
INPUT FILE - http://imgur.com/a/qBKIg
Post-Trimmomatic FastQC report on k-mer overrepresentation - http://imgur.com/a/IMCa8 Trimmomatic Syntax:
java -jar trimmomatic-0.33.jar PE -threads 6 -phred33 -basein EthFoc-11.S285_L007.1.txt -baseout EthFoc-11_S285_L007_trimmomatic5 -trimlog EthFoc-11_trimmomatic5.log ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:true LEADING:3 TRAILING:3
After Trimmomatic, there is k-mer over-representation as shown here. Because I thought this might go away if I clip away the first 10nt away, I performed another Trimmomatic run, this time with HEADCROP:10, so this truncates 10nt at 5' end irrespective of other downstream parameters in the syntax. Post-Trimmomatic FastQC report on k-mer overrepresentation - http://imgur.com/a/y7DVe Trimmomatic modified Syntax:
java -jar trimmomatic-0.33.jar PE -threads 6 -phred33 -basein EthFoc-11.S285_L007.1.txt -baseout EthFoc-11_S285_L007_trimmomatic6 -trimlog EthFoc-11_trimmomatic6.log ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:true HEADCROP:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:24 MINLEN:54
But this does not seem to fix anything, only makes it look worse than before. Am I not interpreting these FastQC results properly?
NEXTflex-96 DNA Barcodes from BiooScientific were used for multiplexing. The instrument used was HiSeq 4000 and the NGS library prep was using Kapa HyperPlus Library Preparation Kit. And my understanding is that the sequencing center had already de-multiplexed these Illlumina files that I am not working with.
Here are couple of links where such a problem is discussed on Biostars. How to remove kmer profiles? - but for RNASeq, so not directly relevant to my WGS samples fastqc: kmer-content failed - k-mer overrepresentation looks quite similar to mine - one source of this is listed to be random priming. Again, mine is not RNASeq sample, but WGS sample. Importantly, I think there is no random priming and no PCR step during Kapa HyperPlus Library Preparation. So I am puzzled WHY k-mers are overrepresented the way they are? In the second post linked above, I did notice how the scale is very different (18), but even in my analyses, first time log2(Obs/Exp) gets worse from ~5 to ~8...
Since I am new to NGS Illumina sequence QC, any & all advice, education, suggestions are welcome.