Question: Kmer Content in FastQC failed
gravatar for hurtc.stri
2.1 years ago by
hurtc.stri10 wrote:

Hi All,

  I am completely new to NGS analysis.  I just received data from a paired-end 125 bp Illumina run.  I ran both data files through the FastQC software to check for data quality and received a few warnings/failures. In the graph Both runs failed the kmer content test. In the graph, all of the lines hover around zero until you get the right side of the graph (near 94-96) where all of the lines exponentially increase to about 9.  Does this mean that there could be adaptors left over on the 5' end?  I'm not sure exactly how to interpret this or what I should do about it. Thank you in advance for any advice/suggestions.



sequencing next-gen • 9.5k views
ADD COMMENTlink modified 16 months ago by alex.rubinsteyn120 • written 2.1 years ago by hurtc.stri10
gravatar for Amitm
2.1 years ago by
Amitm1.4k wrote:


Do not worry too much about the k-mer plot. If your -

1) Per base seq. qual plot is OK with most of the boxes in the green zone and maybe a couple of towards the end falling below the green

2) Per base seq. content plot has the 4 lines overlapping each other. Sometimes you might see that for the first ~10bases or so, the lines are noisy but from there on they should smoothen out.

3) Per seq. GC content plot has single bell shaped hump (more or less) and not two or more. Small shoulders are ok

4) If you see adapters in the Adapter content plot, do adapter removal (and maybe trim reads from the end as well for getting rid of low qual bases) and then redo FastQC.

If all above 3 plots are on those lines and adapters have been removed, you are good to go.

I do not know for what reasons k-mers show enrichment but if its towards the end of read length then it could be due to low qual bases / adapters.

Caveat - The above generalizations are for WES/ WGS/ RNA-seq data. If you have low volume data like from amplicon sequencing you would see a lot of noise as per FastQC. Like the GC plot might have multiple shoulders but this would be due to low diversity in your data (a handful of genes).

ADD COMMENTlink written 2.1 years ago by Amitm1.4k

Thank you so much for the tips.  The per base sequence content plot did issue a warning.  The A/T lines overlap at around 30% and the G/C lines overlap at around 20%.  I assumed this means that my genome is A/T biased?



ADD REPLYlink written 2.1 years ago by hurtc.stri10

Yes, from what you are saying it seems so. You can get a better idea buy looking into the 'Basic Stats' section. It gives you GC% as well.

ADD REPLYlink written 2.1 years ago by Amitm1.4k
gravatar for dally
2.1 years ago by
United States
dally150 wrote:

This has some good examples:

Then you can look into the following: Trimmomatic, Fastx-toolkit to trim data.



ADD COMMENTlink written 2.1 years ago by dally150

Great tutorial - thanks for the link.  Now onto Trimmomatic!

ADD REPLYlink written 2.1 years ago by hurtc.stri10
gravatar for alex.rubinsteyn
16 months ago by
United States
alex.rubinsteyn120 wrote:

I have the same problem, also with 125bp paired end Illumina sequencing. The issue appears to be a shorter than desired fragment size distribution leading to adapter read-through on ~10k reads.

ADD COMMENTlink written 16 months ago by alex.rubinsteyn120

It happens. Use an appropriate trimming program and go on to the next step.

ADD REPLYlink written 16 months ago by genomax42k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 506 users visited in the last hour