Question: kmer distribution tiny peak at low coverage
0
gravatar for deepti1rao
12 months ago by
deepti1rao20
deepti1rao20 wrote:

jellyfish.histo file
fastqc file

I generated a kmer count file using jellyfish and subsequently a histogram, which when plotted in R gave the attached graph. I am confused about why I have a small peak at coverage 22. I see a similar tiny peak even for kmer values as high as 115.

  1. How does one interpret this for a genome expected to have 50-60% repeats.
  2. How can I extract reads pertaining to the tiny peaks?
  3. I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.
  4. Can I safely interpret this tiny peak as a non-erroneous peak and retain those kmers for assembly?
distribution kmer assembly • 608 views
ADD COMMENTlink modified 12 months ago by h.mon24k • written 12 months ago by deepti1rao20

Be more careful adding images please: the URL you use must point directly to the image. For example, you used: https://ibb.co/gFwTZc where you should have used: https://image.ibb.co/kNrHSx/per_sequence_gc_content.png

Right click on the image in the page (https://ibb.co/gFwTZc) and select Copy Image Address to get the actual image URL.

ADD REPLYlink written 12 months ago by RamRS20k
1

Better option is to click on the embed code tab at the bottom of the page and then copy full image HTML link and paste in the post (like below).

per sequence gc content

ADD REPLYlink written 12 months ago by genomax64k
0
gravatar for h.mon
12 months ago by
h.mon24k
Brazil
h.mon24k wrote:

How can I extract reads pertaining to the tiny peaks?

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=40 highbindepth=110

I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.

If by "correlate this" you mean the different kmer peaks, run FastQC again after splitting the reads into coverage bins. Then you can check each bin GC content.

ADD COMMENTlink modified 12 months ago • written 12 months ago by h.mon24k

Thanks for that. I also want to extract the kmer sequences of low coverage, which have given rise to the tiny peak in the kmer distribution (attached). I want to check if they have a higher GC content.

ADD REPLYlink written 12 months ago by deepti1rao20

Coverage bins here would split reads based on read coverage, or the kmer coverage?

ADD REPLYlink written 12 months ago by deepti1rao20

kmer coverage over the length of a read - so it correlates highly with read coverage.

http://seqanswers.com/forums/showthread.php?t=49763

How does BBNorm work, and why is it better than other tools?

BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers.

ADD REPLYlink written 12 months ago by h.mon24k

I just need to know for sure that this tiny peak is genuine data and not erraneous, so that I can include those kmers when I set a kmer coverage cut off for genome assembly.

ADD REPLYlink written 12 months ago by deepti1rao20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 885 users visited in the last hour