More specifics on Jellyfish -C methodology.
1
0
Entering edit mode
5.2 years ago
dlawre14 ▴ 30

I know in Jellyfish -C stands for canonical kmers, however I'm a little iffy on how this is implemented. Does Jellyfish take into account whether the reads are paired-end or not? I'm working on my own kmer software to use internally and want the the results to be equivalent to what jellyfish would spit out.

So far, my understanding is that that -C does not take into account which strand a read came from, but rather creates the reverse complement of any kmer it sees automatically and then classifies both a kmer and its reverse complement as the same kmer.

jellyfish kmer • 1.3k views
2
Entering edit mode
5.2 years ago
Rob 5.2k

There is no special accounting for paired end reads in Jellyfish (or any kmer counting software of which I'm aware). The -C option just means that when Jellyfish looks at a kmer k, it considers both k and rc(k). It associates k and rc(k) with whichever of the two is alphabetically smaller. This means that, e.g. If k is the smaller of the two, the count in the output table for k will be counting both occurrences of k and rc(k), while if rc(k) is the smaller of the two, then the output table wil contain only rc(k), but its count will be that of both rc(k) and k. This also means that no special rules are considered for stranded protocols. The Jellyfish software processes each kmer independently, and precisely what kmers it considers depends on the -C option etc.

0
Entering edit mode

Ok I think I understand most of this, but let me get a specific example, suppose we have ATG occurring 3 times and CAT occurring 2 times, what does this output as in jellyfish, is it CAT> 5 or ATG> 5

0
Entering edit mode

According to the jellyfish manual (https://github.com/gmarcais/Jellyfish/blob/master/doc/jellyfish.pdf) "whichever comes first lexicographically". So, in your case, it would be ATG> 5.

1
Entering edit mode

Ok so it's actually simple, they just count both and then group together the kmer and rc(kmer) and select lexicographically first as the "name" for that set. Thanks for all the help!