How do I count the number of nucleotides using jellyfish; can I obtain the number of nucleotides that went into a mer_counts.jf from jellyfish stats?
1
0
Entering edit mode
8.4 years ago
Tom ▴ 20

I have a comparison that's been puzzling me. Suppose I have a fastq file for a bacteria: rawread.fastq

I do: cat rawread.fastq | awk 'NR%4==2{print}' | tr -d '\n' | wc -c to count the exact number of nucleotides in the rawread file. I get 121916338


I proceed to do a jellyfish analysis:

jellyfish count -m 21 -s 5000 -t 10 -C rawread.fastq

jellyfish stats mer_count.jf

Output is:

Unique: 553597

Distinct: 5700265

Total: 110466529

Max_count: 1832


Shouldnt the number of distinct kmers multiplied by 21 equal the exact nucleotide count? If I've counted every kmer in the file, I should be able to multiply that by the kmer-size to get how much nucleotide information that was fed into the system. There seems to be a deficit. Whats wrong?

jellyfish genomics k-mer kmer • 3.4k views
ADD COMMENT
0
Entering edit mode

If you have 110466529 total k-mers reported for size k=21, then the length of the genome should be n - 21 + 1 = 110466529, and that is n = 110466549. The length of your genome reported by your bash script is 121916338, which differs too much. Does your file contain masked regions? Does jellyfish count these too as part of a k-mer or does it break the mer? Is your bash script counting break lines? (I suppose it is NOT since you have that tr -d \n...) Any of these could be the reason behind the difference. Hope it helps!

ADD REPLY
0
Entering edit mode

His fastQ is of sequenced reads, not a genomic fasta.

ADD REPLY
0
Entering edit mode

(These are raw reads not assemblies.)

Think of the problem like this: Suppose you were ONLY given the .jf file. Derive how many nucleotides the original rawread file had.

ADD REPLY
1
Entering edit mode
8.4 years ago

You math is not taking into account the finite length of reads. Only a complete, unbroken circular genome could contain as many kmers as nucleotides. A N-bp read will contain N-K+1 kmers; so, a 100-bp read will contain 100-21+1=80 kmers, not 100, and only if there are not any no-called bases.

ADD COMMENT
0
Entering edit mode

I'm quantifying the amount of nucleotide information in my raw read, which seems like it should be straightforward. What I really want to know is whether or not there is a special setting or mathematical algorithm I would need to apply to obtain a number close to 121916338, based purely off of the .jf file that I have

ADD REPLY
1
Entering edit mode

That's not possible without knowing the read length also. If you know the read length, the formula would be:

bases = kmers * readlength / (readlength - k + 1)

ADD REPLY
0
Entering edit mode

Yeah no, that's just not working out for me. My read sizes are 251 and calculation is no where close to 121916338

ADD REPLY
0
Entering edit mode

I believe the approximation is: total bases = (total-kmers-found) * readlength / (readlength - k - 1)

If I'm not mistaken. I seem to be getting better results with this metric. I believe the -C option in jellyfish may play a role influencing this calculation

ADD REPLY
1
Entering edit mode

121916338 is not factorable by anything looking like the length of a read x the number of reads (only by 2 and itself), therefore you must have trimmed some bases off or have reads of different lengths for some other reason..?

ADD REPLY
1
Entering edit mode

Good point. Additionally, the presence of Ns in the reads will mess up the calculations.

ADD REPLY
0
Entering edit mode

Are Ns factored into jellyfish? Or are they just straight up ignored?

ADD REPLY
0
Entering edit mode

My money is on ignored, because it probably uses 2bit internally, and so there can be no kmers overlapping the Ns.

However, its best to just check with your data :)

ADD REPLY
0
Entering edit mode

It could just not be possible to obtain this information doing de novo assembly and multiplying the assembly size by depht

ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6