Jellyfish -C option explanation
1
4
Entering edit mode
6.3 years ago
pbigbig ▴ 230

Hi,

When working with Jellyfish, I used sample command in their guide:

$jellyfish count -m 22 -s 100M -t 16 -C merged.fq.gz and I am not really understand the meaning of -C, which is described as "canonical", or (as in Jellyfish 1) "Count both strand, canonical representation". Doesn't it should be a default option (no need to put in)? What is the consequence if I do not include this -C in my jellyfish command? I also wonder that is this option also pre-implemented in kmergenie? Thank you in advance for your clarification! jellyfish • 4.9k views ADD COMMENT 8 Entering edit mode 6.3 years ago Rob 5.0k When counting k-mers in sequencing reads, there is really no way to differentiate between k-mers and their reverse complement. What I mean by this is that seeing e.g. ACGGT is equivalent to seeing ACCGT, since the latter is the reverse complement of the former and the sequenced reads don't originate from a prescribed strand of the DNA. The '-C' command in jellyfish considers both a k-mer and its reverse complement as equivalent, and associates the count for both (the sum of the count of a kmer and its reverse complement) with the k-mer among the two that is lexicographically smaller. So, for example, above only ACCGT would be stored and its count would be equal to the number of occurrences of both ACCGT and ACGGT. If you don't include '-C' in your jellyfish options, these k-mers will be treated separately. There's nothing "wrong" with this, per-se, but it may not be what you want. ADD COMMENT 1 Entering edit mode EDIT: I misread Robs explanation. As he stated -C collapses k-mers with its reverse complement. >test1 AAAAATTTTT  jellyfish count -s 1M -k 5 -o withoutC.jf test.fa && jellyfish dump test.fa >1 AAAAA >1 AAAAT >1 TTTTT >1 AATTT >1 AAATT >1 ATTTT  jellyfish count -C -s 1M -k 5 -o withC.jf test.fa && jellyfish dump >2 AAAAA >2 AAAAT >2 AAATT  ADD REPLY 0 Entering edit mode That's what I said. '-C' considers both a kmer and its reverse "complement as equivalent". This means it collapses them. ADD REPLY 1 Entering edit mode Yeah, proper reading definitely helps ;) - Sorry about the that! ADD REPLY 0 Entering edit mode No problem ;P ADD REPLY 0 Entering edit mode Thanks a lot Rob! ADD REPLY 0 Entering edit mode Hi, I just have experimented with ~30Gbyte of Miseq data, the final merged (by$ cat command) fastq file contains 3 sets of pair-end data (that means totally 6 fastq files). When run for 22mer with Jellyfish 2, with counting command included -C, I obtained ~700Mb genome size and coverage was about 27X. However, when using same command minus the -C, I obtained ~1.4Gb genome size and coverage was about 13X.

So which result is true? my data was haploid pair-end reads

Additionally, I 'm not really sure but kmergenie seem to be pre-determined to treat kmer and its reverse complement counterpart as a single block, therefore, is kmergenie always process in the same way as jellyfish with -C does?

Any suggestions and ideas are greatly welcome! Thanks

0
Entering edit mode

The result with -C (~700Mb) is true. Treating reverse complement kmers independently would not make sense in a shotgun data set because you expect to have 50/50 fragments from both strands.

0
Entering edit mode

Thanks a lot