Question

fastqc: kmer-content failed

0

Entering edit mode

7.0 years ago

shuchen ▴ 10

After running trimmomatic to trim the raw reads, the fastqc still reported kmer-content failure. What should I do with this issue? Thank you so much!

enter image description here

fastqc kmer-content fail • 6.1k views

ADD COMMENT • link 7.0 years ago by shuchen ▴ 10

0

Entering edit mode

I looked up similar issues and some have suggested that we need not worry too much about kmer-content plot. But in this data, it seems that the kmer enrichment is the most serious at the 5' end of the read. Could it be caused by adapter contamination but was not detected by fastqc because the adapter was not in the adapter list?

ADD REPLY • link 7.0 years ago by shuchen ▴ 10

0

Entering edit mode

Map and look at mismatches per position. bbmap.sh can do just that with a single command:

bbmap.sh in=file.fastq out=file.bam ref=genome.fas mhist=mismatch.txt indelhist=indels.txt

If your enriched kmers are artifacts, there will be an increase of mismatches and / or indels at the first positions.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

An error occurred while running bbmap. Should I change the maximum heap size？

java -Djava.library.path=/home/shu/App/bbmap/jni/ -ea -Xmx-362m -cp /home/shu/App/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in=f.fq out=f.bam ref=GCF_000005845.2_ASM584v2_genomic.fna mhist=mismatch.txt indelhist=indels.txt Invalid maximum heap size: -Xmx-362m Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit.

ADD REPLY • link 7.0 years ago by shuchen ▴ 10

2

Entering edit mode

You set your heap to -362m, rather than 362m. The flag should have been "-Xmx362m". But, it looks like the absolute minimum amount of memory BBMap needs, for a bacterial-size genome with default parameters, is around "-Xmx800m". For a smaller kmer length of 12 you can get by with "-Xmx362m".

ADD REPLY • link 7.0 years ago by Brian Bushnell 20k

0

Entering edit mode

I aligned the reads towards an e. coli genome downloaded from NCBI using bowtie2, --very-sensitive-local, but only got 66.91% overall alignment rate. Is this alingment rate normal?

1593573 reads; of these:
1593573 (100.00%) were paired; of these:
570107 (35.78%) aligned concordantly 0 times
981390 (61.58%) aligned concordantly exactly 1 time
42076 (2.64%) aligned concordantly >1 times
----
570107 pairs aligned concordantly 0 times; of these:
34907 (6.12%) aligned discordantly 1 time
----
535200 pairs aligned 0 times concordantly or discordantly; of these:
  1070400 mates make up the pairs; of these:
    1054753 (98.54%) aligned 0 times
    10990 (1.03%) aligned exactly 1 time
    4657 (0.44%) aligned >1 times
    66.91% overall alignment rate

ADD REPLY • link 7.0 years ago by shuchen ▴ 10

0

Entering edit mode

That depends on the situation. Aligning an isolate to an isolate assembly, 66% is terrible. Aligning metagenomic reads to their assembly... it's often good.

ADD REPLY • link 7.0 years ago by Brian Bushnell 20k

0

Entering edit mode

I was using a SRA dataset which is an isolate. So what do you think might be the problem? I am not very familiar with bacteria. Is it possible that E. coli DNA sequence is very different between strains and caused this low alignment rate.

ADD REPLY • link 7.0 years ago by shuchen ▴ 10

0

Entering edit mode

Could be adapter sequence, could be contaminants, could be low quality. Hard to say. You might try BLASTing a handful of the unmapped reads to see what they hit. E.coli's tend to be very similar so it's unlikely that divergence is the problem.

ADD REPLY • link 6.9 years ago by Brian Bushnell 20k

0

Entering edit mode

I blasted the unmapped reads and it turned out a lot of them are 100% similar to ecoli genes, so I am a little confused why it did not map to the ecoli genome at the first place.

ADD REPLY • link 6.9 years ago by shuchen ▴ 10

0

Entering edit mode

Did you check the other read pair why it not mapping? May be this is adaptor contamination, may be low quality. And one of your problems is the --very-sensitive-local mode, which will bias bowtie2 to look for local instead of end-to-end match. Try running with default params.

ADD REPLY • link 6.9 years ago by Santosh Anand 5.7k

0

Entering edit mode

I don't think you need fastareadlen=500 if your data is fastq. The amount of memory you need depends on the size of the reference genome, but yes, 362Mb seems very small.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Ok. Thanks. I will try it on a computer with larger memory.

ADD REPLY • link 7.0 years ago by shuchen ▴ 10

score 3 · Accepted Answer · 2017-05-15

3

Entering edit mode

7.0 years ago

Santosh Anand 5.7k

I would not be too much worried about the k-mer content unless there are other warning flags in QC. Moreover, the 5'-end has some bias for priming https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/

What is your library (RNA-seq)?

ADD COMMENT • link 7.0 years ago by Santosh Anand 5.7k

1

Entering edit mode

Thank you so much! I think this is exactly what the problem is. I was actually trying out with a downloaded SRA data. Although it was a genome sequencing but I think it used random priming and caused uneven base content at the start of the sequence. So this is actually a problem that can not be fixed by trimming.

enter image description here

ADD REPLY • link 7.0 years ago by shuchen ▴ 10

0

Entering edit mode

Happy that it is resolved. I've moved my comment to answer.

ADD REPLY • link 7.0 years ago by Santosh Anand 5.7k

0

Entering edit mode

How did you resolve the problem? I have RNA seq data and fastqc reported Adapter Content pass but kmer-content failure.

ADD REPLY • link 3.4 years ago by archanaverma433 ▴ 10