Question: How do I count the number of nucleotides using jellyfish; can I obtain the number of nucleotides that went into a mer_counts.jf from jellyfish stats?
0
gravatar for Tom
3.0 years ago by
Tom20
United States
Tom20 wrote:

I have a comparison that's been puzzling me. Suppose I have a fastq file for a bacteria: rawread.fastq

I do: cat rawread.fastq | awk 'NR%4==2{print}' | tr -d '\n' | wc -c to count the exact number of nucleotides in the rawread file. I get 121916338


I proceed to do a jellyfish analysis:

jellyfish count -m 21 -s 5000 -t 10 -C rawread.fastq

jellyfish stats mer_count.jf

Output is:

Unique: 553597

Distinct: 5700265

Total: 110466529

Max_count: 1832


Shouldnt the number of distinct kmers multiplied by 21 equal the exact nucleotide count? If I've counted every kmer in the file, I should be able to multiply that by the kmer-size to get how much nucleotide information that was fed into the system. There seems to be a deficit. Whats wrong?

jellyfish genomics kmer k-mer • 1.3k views
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Tom20

If you have 110466529 total k-mers reported for size k=21, then the length of the genome should be n - 21 + 1 = 110466529, and that is n = 110466549. The length of your genome reported by your bash script is 121916338, which differs too much. Does your file contain masked regions? Does jellyfish count these too as part of a k-mer or does it break the mer? Is your bash script counting break lines? (I suppose it is NOT since you have that tr -d \n...) Any of these could be the reason behind the difference. Hope it helps!

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by estebanpw30

His fastQ is of sequenced reads, not a genomic fasta.

ADD REPLYlink written 3.0 years ago by John12k

(These are raw reads not assemblies.)

Think of the problem like this: Suppose you were ONLY given the .jf file. Derive how many nucleotides the original rawread file had.

ADD REPLYlink written 3.0 years ago by Tom20
1
gravatar for Brian Bushnell
3.0 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You math is not taking into account the finite length of reads. Only a complete, unbroken circular genome could contain as many kmers as nucleotides. A N-bp read will contain N-K+1 kmers; so, a 100-bp read will contain 100-21+1=80 kmers, not 100, and only if there are not any no-called bases.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Brian Bushnell16k

I'm quantifying the amount of nucleotide information in my raw read, which seems like it should be straightforward. What I really want to know is whether or not there is a special setting or mathematical algorithm I would need to apply to obtain a number close to 121916338, based purely off of the .jf file that I have

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Tom20
1

That's not possible without knowing the read length also. If you know the read length, the formula would be:

bases = kmers * readlength / (readlength - k + 1)

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Brian Bushnell16k

Yeah no, that's just not working out for me. My read sizes are 251 and calculation is no where close to 121916338

ADD REPLYlink written 3.0 years ago by Tom20

I believe the approximation is: total bases = (total-kmers-found) * readlength / (readlength - k - 1)

If I'm not mistaken. I seem to be getting better results with this metric. I believe the -C option in jellyfish may play a role influencing this calculation

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Tom20
1

121916338 is not factorable by anything looking like the length of a read x the number of reads (only by 2 and itself), therefore you must have trimmed some bases off or have reads of different lengths for some other reason..?

ADD REPLYlink written 3.0 years ago by John12k
1

Good point. Additionally, the presence of Ns in the reads will mess up the calculations.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Brian Bushnell16k

Are Ns factored into jellyfish? Or are they just straight up ignored?

ADD REPLYlink written 3.0 years ago by Tom20

My money is on ignored, because it probably uses 2bit internally, and so there can be no kmers overlapping the Ns.

However, its best to just check with your data :)

ADD REPLYlink written 3.0 years ago by John12k

It could just not be possible to obtain this information doing de novo assembly and multiplying the assembly size by depht

ADD REPLYlink written 3.0 years ago by Tom20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 952 users visited in the last hour