Question: kmer distribution tiny peak at low coverage
0
gravatar for deepti1rao
16 months ago by
deepti1rao20
deepti1rao20 wrote:

jellyfish.histo file
fastqc file

I generated a kmer count file using jellyfish and subsequently a histogram, which when plotted in R gave the attached graph. I am confused about why I have a small peak at coverage 22. I see a similar tiny peak even for kmer values as high as 115.

  1. How does one interpret this for a genome expected to have 50-60% repeats.
  2. How can I extract reads pertaining to the tiny peaks?
  3. I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.
  4. Can I safely interpret this tiny peak as a non-erroneous peak and retain those kmers for assembly?
distribution kmer assembly • 761 views
ADD COMMENTlink modified 16 months ago by h.mon26k • written 16 months ago by deepti1rao20

Be more careful adding images please: the URL you use must point directly to the image. For example, you used: https://ibb.co/gFwTZc where you should have used: https://image.ibb.co/kNrHSx/per_sequence_gc_content.png

Right click on the image in the page (https://ibb.co/gFwTZc) and select Copy Image Address to get the actual image URL.

ADD REPLYlink written 16 months ago by RamRS22k
1

Better option is to click on the embed code tab at the bottom of the page and then copy full image HTML link and paste in the post (like below).

per sequence gc content

ADD REPLYlink written 16 months ago by genomax69k
0
gravatar for h.mon
16 months ago by
h.mon26k
Brazil
h.mon26k wrote:

How can I extract reads pertaining to the tiny peaks?

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=40 highbindepth=110

I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.

If by "correlate this" you mean the different kmer peaks, run FastQC again after splitting the reads into coverage bins. Then you can check each bin GC content.

ADD COMMENTlink modified 16 months ago • written 16 months ago by h.mon26k

Thanks for that. I also want to extract the kmer sequences of low coverage, which have given rise to the tiny peak in the kmer distribution (attached). I want to check if they have a higher GC content.

ADD REPLYlink written 16 months ago by deepti1rao20

Coverage bins here would split reads based on read coverage, or the kmer coverage?

ADD REPLYlink written 16 months ago by deepti1rao20

kmer coverage over the length of a read - so it correlates highly with read coverage.

http://seqanswers.com/forums/showthread.php?t=49763

How does BBNorm work, and why is it better than other tools?

BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers.

ADD REPLYlink written 16 months ago by h.mon26k

I just need to know for sure that this tiny peak is genuine data and not erraneous, so that I can include those kmers when I set a kmer coverage cut off for genome assembly.

ADD REPLYlink written 16 months ago by deepti1rao20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 629 users visited in the last hour