Hi, apologies for this basic question as I am new to the field. I have been checking my NGS data using FastQC and the checks fail on the Kmer content section. There appears to be a sequence TAGATCGGAA at position 90-100 bp in the reads which is enriched around 12 fold (Obs/Exp Max). However nothing shows up in the 'Overrepresented sequences' or 'Adapter content' sections of the report (complete flat line). The sequencing was done as BGI and I do not know what primers were used. My question is should I be trying to remove this Kmer sequence?
If I use grep on the first million reads in my fastq file to look for the sequence I only find 36 which appears to be quite a low number?
$ gunzip -c data1.fq.gz | head -4000000 | grep TAGATCGGAA | wc -l
Thanks for the help
This plot in general and that fold change and pvalue in particular are not informative and, in the vast majority of cases cause only confusion.
This plot should be removed from the output of this tool.
Agreed, I really wish that FastQC had big warnings above their plots like "this is only meaningful for whole-genome sequencing".
I have ChIPseq data with a similar problem, the quality scores are good for this data, so I aligned the reads using bowtie and used macs2 for peak calling and only when I use this particular sample (input) I do not get peaks for my IP samples. Using a different input will yield peaks for all IP-samples. I will appreciate any feedback.
Try doing a PCA or clustering of the various input and ChIP samples (e.g., with deepTools), perhaps you have a sample swap.
I suspect that the outlier sample has signal mostly at regions that should be blacklisted. Either way, you have one weird sample that might need to be excluded, that happens.
I see a very similar kmer plot with exact same kmers showing up as over-represented. I am wondering if you were able to find the cause of these. My data is Agilent exomes so I am wondering if there could be adapters/baits/coltrol sequences in the target enrichment kit that are popping up.