kmer distribution tiny peak at low coverage
1
0
Entering edit mode
4.2 years ago
deepti1rao ▴ 40

I generated a kmer count file using jellyfish and subsequently a histogram, which when plotted in R gave the attached graph. I am confused about why I have a small peak at coverage 22. I see a similar tiny peak even for kmer values as high as 115.

1. How does one interpret this for a genome expected to have 50-60% repeats.
2. How can I extract reads pertaining to the tiny peaks?
3. I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.
4. Can I safely interpret this tiny peak as a non-erroneous peak and retain those kmers for assembly?
kmer distribution Assembly • 2.3k views
0
Entering edit mode

Be more careful adding images please: the URL you use must point directly to the image. For example, you used: https://ibb.co/gFwTZc where you should have used: https://image.ibb.co/kNrHSx/per_sequence_gc_content.png

Right click on the image in the page (https://ibb.co/gFwTZc) and select Copy Image Address to get the actual image URL.

1
Entering edit mode

Better option is to click on the embed code tab at the bottom of the page and then copy full image HTML link and paste in the post (like below).

0
Entering edit mode
4.2 years ago
h.mon 34k

How can I extract reads pertaining to the tiny peaks?

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=40 highbindepth=110


I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.

If by "correlate this" you mean the different kmer peaks, run FastQC again after splitting the reads into coverage bins. Then you can check each bin GC content.

0
Entering edit mode

Thanks for that. I also want to extract the kmer sequences of low coverage, which have given rise to the tiny peak in the kmer distribution (attached). I want to check if they have a higher GC content.

0
Entering edit mode

Coverage bins here would split reads based on read coverage, or the kmer coverage?

0
Entering edit mode

kmer coverage over the length of a read - so it correlates highly with read coverage.

How does BBNorm work, and why is it better than other tools?

BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers.

0
Entering edit mode

I just need to know for sure that this tiny peak is genuine data and not erraneous, so that I can include those kmers when I set a kmer coverage cut off for genome assembly.