Question: fastqc: kmer-content failed
0
gravatar for shuchen
19 months ago by
shuchen10
shuchen10 wrote:

After running trimmomatic to trim the raw reads, the fastqc still reported kmer-content failure. What should I do with this issue? Thank you so much!

enter image description here

enter image description here

fastqc fail kmer-content • 2.2k views
ADD COMMENTlink modified 19 months ago • written 19 months ago by shuchen10

I looked up similar issues and some have suggested that we need not worry too much about kmer-content plot. But in this data, it seems that the kmer enrichment is the most serious at the 5' end of the read. Could it be caused by adapter contamination but was not detected by fastqc because the adapter was not in the adapter list?

ADD REPLYlink modified 19 months ago • written 19 months ago by shuchen10

Map and look at mismatches per position. bbmap.sh can do just that with a single command:

bbmap.sh in=file.fastq out=file.bam ref=genome.fas mhist=mismatch.txt indelhist=indels.txt

If your enriched kmers are artifacts, there will be an increase of mismatches and / or indels at the first positions.

ADD REPLYlink written 19 months ago by h.mon22k

An error occurred while running bbmap. Should I change the maximum heap size?

java -Djava.library.path=/home/shu/App/bbmap/jni/ -ea -Xmx-362m -cp /home/shu/App/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in=f.fq out=f.bam ref=GCF_000005845.2_ASM584v2_genomic.fna mhist=mismatch.txt indelhist=indels.txt Invalid maximum heap size: -Xmx-362m Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit.

ADD REPLYlink modified 19 months ago • written 19 months ago by shuchen10
2

You set your heap to -362m, rather than 362m. The flag should have been "-Xmx362m". But, it looks like the absolute minimum amount of memory BBMap needs, for a bacterial-size genome with default parameters, is around "-Xmx800m". For a smaller kmer length of 12 you can get by with "-Xmx362m".

ADD REPLYlink written 19 months ago by Brian Bushnell16k

I aligned the reads towards an e. coli genome downloaded from NCBI using bowtie2, --very-sensitive-local, but only got 66.91% overall alignment rate. Is this alingment rate normal?

1593573 reads; of these:
1593573 (100.00%) were paired; of these:
570107 (35.78%) aligned concordantly 0 times
981390 (61.58%) aligned concordantly exactly 1 time
42076 (2.64%) aligned concordantly >1 times
----
570107 pairs aligned concordantly 0 times; of these:
34907 (6.12%) aligned discordantly 1 time
----
535200 pairs aligned 0 times concordantly or discordantly; of these:
  1070400 mates make up the pairs; of these:
    1054753 (98.54%) aligned 0 times
    10990 (1.03%) aligned exactly 1 time
    4657 (0.44%) aligned >1 times
    66.91% overall alignment rate
ADD REPLYlink modified 19 months ago • written 19 months ago by shuchen10

That depends on the situation. Aligning an isolate to an isolate assembly, 66% is terrible. Aligning metagenomic reads to their assembly... it's often good.

ADD REPLYlink modified 19 months ago • written 19 months ago by Brian Bushnell16k

I was using a SRA dataset which is an isolate. So what do you think might be the problem? I am not very familiar with bacteria. Is it possible that E. coli DNA sequence is very different between strains and caused this low alignment rate.

ADD REPLYlink modified 19 months ago • written 19 months ago by shuchen10

Could be adapter sequence, could be contaminants, could be low quality. Hard to say. You might try BLASTing a handful of the unmapped reads to see what they hit. E.coli's tend to be very similar so it's unlikely that divergence is the problem.

ADD REPLYlink written 19 months ago by Brian Bushnell16k

I blasted the unmapped reads and it turned out a lot of them are 100% similar to ecoli genes, so I am a little confused why it did not map to the ecoli genome at the first place.

ADD REPLYlink written 19 months ago by shuchen10

Did you check the other read pair why it not mapping? May be this is adaptor contamination, may be low quality. And one of your problems is the --very-sensitive-local mode, which will bias bowtie2 to look for local instead of end-to-end match. Try running with default params.

ADD REPLYlink written 19 months ago by Santosh Anand4.3k

I don't think you need fastareadlen=500 if your data is fastq. The amount of memory you need depends on the size of the reference genome, but yes, 362Mb seems very small.

ADD REPLYlink written 19 months ago by h.mon22k

Ok. Thanks. I will try it on a computer with larger memory.

ADD REPLYlink written 19 months ago by shuchen10
3
gravatar for Santosh Anand
19 months ago by
Santosh Anand4.3k
Santosh Anand4.3k wrote:

I would not be too much worried about the k-mer content unless there are other warning flags in QC. Moreover, the 5'-end has some bias for priming https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/

What is your library (RNA-seq)?

ADD COMMENTlink written 19 months ago by Santosh Anand4.3k
1

Thank you so much! I think this is exactly what the problem is. I was actually trying out with a downloaded SRA data. Although it was a genome sequencing but I think it used random priming and caused uneven base content at the start of the sequence. So this is actually a problem that can not be fixed by trimming.

enter image description here

enter image description here

ADD REPLYlink written 19 months ago by shuchen10

Happy that it is resolved. I've moved my comment to answer.

ADD REPLYlink modified 19 months ago • written 19 months ago by Santosh Anand4.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1397 users visited in the last hour