Hi, apologies for this basic question as I am new to the field. I have been checking my NGS data using FastQC and the checks fail on the Kmer content section. There appears to be a sequence TAGATCGGAA at position 90-100 bp in the reads which is enriched around 12 fold (Obs/Exp Max). However nothing shows up in the 'Overrepresented sequences' or 'Adapter content' sections of the report (complete flat line). The sequencing was done as BGI and I do not know what primers were used. My question is should I be trying to remove this Kmer sequence?
If I use grep on the first million reads in my fastq file to look for the sequence I only find 36 which appears to be quite a low number?:
$ gunzip -c data1.fq.gz | head -4000000 | grep TAGATCGGAA | wc -l
Thanks for the help
This plot in general and that fold change and pvalue in particular are not informative and, in the vast majority of cases cause only confusion.
This plot should be removed from the output of this tool.
Agreed, I really wish that FastQC had big warnings above their plots like "this is only meaningful for whole-genome sequencing".
Hi @kezcleal I see a very similar kmer plot with exact same kmers showing up as over-represented. I am wondering if you were able to find the cause of these. My data is Agilent exomes so I am wondering if there could be adapters/baits/coltrol sequences in the target enrichment kit that are popping up. Thanks !