Question: Jellyfish-Kmer, Genome Size Estimation
2
gravatar for jackuser1979
5.6 years ago by
jackuser1979860
US
jackuser1979860 wrote:

I have filtered paired-end genomic DNA sequence of non-model plant, sequenced through Hiseq 2000.Now I want to do k-mer analysis and want to estimate genome size. I am have tried to use jellyfish for my analysis. Please let me know the steps and commands to do k-mer and genome size estimation. I just tried with basic command from jellyfish manual

jellyfish count -m 21 -s 100M -t 10 -C reads.fasta

illumina • 15k views
ADD COMMENTlink modified 5.5 years ago • written 5.6 years ago by jackuser1979860
6
gravatar for Joseph Hughes
5.6 years ago by
Joseph Hughes2.7k
Scotland, UK
Joseph Hughes2.7k wrote:

After counting the kmers, the next step is to use the histo command to get the counts for each kmer, use the -f option to also get the counts of 0s: jellyfish count – m 21 –o output –s 100M –t 10 -C reads.fasta

jellyfish histo –f output > histogram.txt

Plot this histogram. Get the number of distinct kmers and the number of distinct error kmers. The error kmers are those between zero and the dip before the hump.

Determine the multiplicity of the second kmer peak (the hump). Divide the total number of good kmers (distinct kmers minus error kmers) by the multiplicity to get an estimate of the total genome size.

ADD COMMENTlink written 5.6 years ago by Joseph Hughes2.7k

Wait, why are we dividing by the multiplicity of the camel hump peak?

ADD REPLYlink written 3.8 years ago by Tom20

Multiplicity of the hump = depth of coverage, so divide to determine 1X depth.

ADD REPLYlink written 3.8 years ago by harold.smith.tarheel4.4k

Wait what? I'm now even more confused.

This whole time, I've been taking the point at which we reach the first minima (dip), and just summing up everything after that. I've been getting numbers that are spot on to the correct genome size of my organism. If I start dividing by like 8 or 13, which is the multiplicity of most of my local maximas, then I will radically throw off my numbers. I've never needed to divide by anything, and the method seems largely the same as that of kmergenie's estimations.

 

Something else must be going on. My reads are of high coverage (30x). yet, my peak is about 13.

 

Can you give my a citation? I've literally never heard of this business with the multiplicity at the hump before. It wouldn't make any sense that the local minima's multiplicity is equal to the coverage because everything after the noise peak is just actual kmers. Adding more reads can't result in any higher numbers because the summation caps out at some assymtote

ADD REPLYlink written 3.8 years ago by Tom20

You must be describing a different value, perhaps the number of unique true kmers. That number will approach a limit with increasing data, as you saturate your sampling. And, if most of the genome is unique, then it's a reasonable approximation of genome size.

On the other hand, counting the frequency of each kmer is ENTIRELY dependent upon the read depth. Think about it: if you double the number of reads, then (on average) you double the number of times you count each true kmer in the genome.

A step-by-step explanation can be found at this link: http://koke.asrc.kanazawa-u.ac.jp/HOWTO/kmer-genomesize.html

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by harold.smith.tarheel4.4k

Strange, this never occurred to me before.I decided to test this experiment by taking a raw read fastq file, and concatenating it with itself in order to simulate twice the number of reads that I have.
 

I proceeded to do a jellyfish analysis "jellyfish count -m 21 -s 100M -t 10 -C file.fastq". And when I proceeded to model the k-mer multiplicity histogram, I got the same exact results. Can you explain what's going on? I was expecting all my bars to move up by double

ADD REPLYlink written 3.8 years ago by Tom20
5
gravatar for Josh Herr
5.6 years ago by
Josh Herr5.7k
University of Nebraska
Josh Herr5.7k wrote:

+1 on both answers from Joseph Hughes and Zev.Kronenberg. Jellyfish is a really robust tool for counting k-mers.

If you're just looking for the estimated genome size, I would use Rayan Chikhi's KmerGenie which -- with a single command -- will output lots of graphs and give you the estimated genome size in the output file.

I was a little disappointed with your question as I found it to be a bit lazy -- with any tool you should always read the documentation (and the relevant literature) to figure out how to use it. We're here to help you, but people will be much more eager to help you when they see you have put some time into a question before asking how to do something.

ADD COMMENTlink written 5.6 years ago by Josh Herr5.7k
4
gravatar for Zev.Kronenberg
5.6 years ago by
United States
Zev.Kronenberg11k wrote:

I think this page has a great explanation :

https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish

ADD COMMENTlink written 5.6 years ago by Zev.Kronenberg11k
0
gravatar for jackuser1979
5.5 years ago by
jackuser1979860
US
jackuser1979860 wrote:

To answer my own question, there is available perl script to get k-mer distribution from jellyfish with nice plot developed by Joesphryan.

ADD COMMENTlink written 5.5 years ago by jackuser1979860
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 934 users visited in the last hour