How do I count the number of nucleotides using jellyfish; can I obtain the number of nucleotides that went into a mer_counts.jf from jellyfish stats?
1
0
Entering edit mode
5.5 years ago
Tom ▴ 20

I have a comparison that's been puzzling me. Suppose I have a fastq file for a bacteria: rawread.fastq

I do: cat rawread.fastq | awk 'NR%4==2{print}' | tr -d '\n' | wc -c to count the exact number of nucleotides in the rawread file. I get 121916338

I proceed to do a jellyfish analysis:

jellyfish count -m 21 -s 5000 -t 10 -C rawread.fastq

jellyfish stats mer_count.jf

Output is:

Unique: 553597

Distinct: 5700265

Total: 110466529

Max_count: 1832

Shouldnt the number of distinct kmers multiplied by 21 equal the exact nucleotide count? If I've counted every kmer in the file, I should be able to multiply that by the kmer-size to get how much nucleotide information that was fed into the system. There seems to be a deficit. Whats wrong?

jellyfish genomics k-mer kmer • 2.0k views
0
Entering edit mode

If you have 110466529 total k-mers reported for size k=21, then the length of the genome should be n - 21 + 1 = 110466529, and that is n = 110466549. The length of your genome reported by your bash script is 121916338, which differs too much. Does your file contain masked regions? Does jellyfish count these too as part of a k-mer or does it break the mer? Is your bash script counting break lines? (I suppose it is NOT since you have that tr -d \n...) Any of these could be the reason behind the difference. Hope it helps!

0
Entering edit mode

His fastQ is of sequenced reads, not a genomic fasta.

0
Entering edit mode

(These are raw reads not assemblies.)

Think of the problem like this: Suppose you were ONLY given the .jf file. Derive how many nucleotides the original rawread file had.

1
Entering edit mode
5.5 years ago

You math is not taking into account the finite length of reads. Only a complete, unbroken circular genome could contain as many kmers as nucleotides. A N-bp read will contain N-K+1 kmers; so, a 100-bp read will contain 100-21+1=80 kmers, not 100, and only if there are not any no-called bases.

0
Entering edit mode

I'm quantifying the amount of nucleotide information in my raw read, which seems like it should be straightforward. What I really want to know is whether or not there is a special setting or mathematical algorithm I would need to apply to obtain a number close to 121916338, based purely off of the .jf file that I have

1
Entering edit mode

That's not possible without knowing the read length also. If you know the read length, the formula would be:

0
Entering edit mode

Yeah no, that's just not working out for me. My read sizes are 251 and calculation is no where close to 121916338

0
Entering edit mode

I believe the approximation is: total bases = (total-kmers-found) * readlength / (readlength - k - 1)

If I'm not mistaken. I seem to be getting better results with this metric. I believe the -C option in jellyfish may play a role influencing this calculation

1
Entering edit mode

121916338 is not factorable by anything looking like the length of a read x the number of reads (only by 2 and itself), therefore you must have trimmed some bases off or have reads of different lengths for some other reason..?

1
Entering edit mode

Good point. Additionally, the presence of Ns in the reads will mess up the calculations.

0
Entering edit mode

Are Ns factored into jellyfish? Or are they just straight up ignored?

0
Entering edit mode

My money is on ignored, because it probably uses 2bit internally, and so there can be no kmers overlapping the Ns.

However, its best to just check with your data :)

0
Entering edit mode

It could just not be possible to obtain this information doing de novo assembly and multiplying the assembly size by depht