I have a large set of FASTQ files from genomic DNA. I ran them through FastQC and found that the modules "overrepresented sequences" and "Kmer content" failed. The rest of the modules did not fail, except a warning in "Per tile sequence". Such pattern was present in almost all FASTQ files (>1000 files).

The "overrepresented sequences" module pointed out the presence of TruSeq adapters and Illumina PCR Primer 1.

I ran them through Trimmomatic to remove adapters. The module "overrepresented sequences" was fixed, but "Kmer content" failed again, only this time the pattern was different. Moreover, I get a new warning for the "Per sequence GC content" module (please see linked figure).

I have read that this pattern in "Kmer content" before trimming (kmers found at the beginning of the reads) could be due to fragmentation bias.

I worked with the adapter file provided by Trimmomatic (TruSeq3-PE-2.fa)

This are the flags I used for trimmomatic:

java -jar trimmomatic-0.38.jar PE -phred33 ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

I have two questions:

  • Is the "kmer content" and "Per sequence GC content" profiles after trimming something to worry about?

  • What could be a possible reason for the change in "kmer content" after trimming?

Here you can find the FastQC reports before and after running Trimmomatic:

And here is a comparison of "kmer content" and "Per sequence GC content" before and after trimming:

Thank you very much in advance

Failing k-mer content and GC content in FastQC generally has no immediate adverse effect on your analysis. You should proceed with further analysis and see what you get. In latest FastQC k-mer analysis tool has been turned off by default since it causes more heartaches than necessary.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax64k
