In Jellyfish, what does the "-C" switch do?
2
1
Entering edit mode
5.9 years ago
Tom ▴ 20

In plain English, with a background explanation, can someone explain what that -C command does? It says something about canonical, but I'm just finding this to be too ambiguous. What does that even mean? What does "both strands" mean? Is it doing a kmer analysis on a generated compliment strand from the raw reads??

Jellyfish Kmer Assembly • 1.6k views
2
Entering edit mode
5.9 years ago
John 13k

Straight from man jellyfish (one of the lesser-known Marvel superheros):

ORIENTATION
When the orientation of the sequences in the input fasta file is not known, e.g. in sequencing reads, using
--both-strands (-C) makes the most sense.

For any k-mer m, its canonical representation is m itself or its reverse-complement, whichever comes first
lexicographically. With the option -C, only the canonical representation of the mers are stored in the hash
and the count value is the number of occurrences of both the mer and its reverse-complement.

This flag always causes so much confusion, because many biologists like myself do not know what "canonical" in this context means. I would have thought canonical meant mapped to the forward strand or something - the 'canon' DNA sequence - but no. For example, 'TAGGGACT', although it exists in the human genome, is non-canonical (using the Jellyfish definition), because it's lexographically after 'AGTCCCTA', and thus it will get rev.comp'd before adding it to Jellyfish's hashtable. I think it was mainly thought of to save RAM, as jelly only needs to store 1/2 as many DNA fragments when the k-mer size is small, and not anything to do with 'fixing' reverse complimented reads from sequencing. If you wanted to solve that specific problem, you'd map your reads then have jelly parse the BAM file, using the flag for reverse strand to rev. comp. when needed.

1
Entering edit mode

Perhaps from a biological perspective "Canonical" is misleading, but it's definitely the right terminology from a math / computer-science perspective. Thanks for sharing your perspective, though - I was not aware that it caused any confusion.

0
Entering edit mode

Ah is that where it comes from! :) Sorry, I'm a comic-book nerd who was never very good at maths, so maybe canonical was only non-obvious to me.

But the --both-strands flag makes it very clear, plus the man page I think makes it clear too. It's only the jellyfish count --help I think which is confusing, because the description of -C is:

-C, --canonical Count both strand, canonical representation (false)

Maybe something like:

Rev.comp. mer if result is alphabetically sooner. Better reflects strand ambiguity.

Its a tough one to cram into < 100 char though! Maybe just put 'see man page', haha :)

0
Entering edit mode

Wouldn't not counting a k-mer fragment that has a matching reverse compliment already in the jellyfish hash table for k-mers mean that you would inadvertently skip sequences? What if a section of the genome "just so happens" to have a sequence that matches a reverse compliment of a k-mer already in the hash table?

0
Entering edit mode

Yes, there are lots of regions in the genome which are just TTTTTTTTTTTTTTTTTT for example, which will always get turned to AAA.., but the idea was that in situations where you don't know the strand (single end sequencing which maps to both strands just as well - or reads which haven't/can't be mapped to a genome for some reason) it makes no-sense to treat AAA- any differently from TTT-, so combining them into one makes sense.

But again, in situations where you could map and then discern the strand, it would make more sense to rev. comp. what you know to be on the reverse strand and drop the few reads you are unsure of. But not everyone has this luxury...

0
Entering edit mode
5.9 years ago

Reads from genomic libraries will be from either strand of the DNA. If you don't include the -C flag, then k-mers from each strand will be counted separately, even though they should really be counted together.

For example, if you have a genomic segment represented by these two complementary strands:

AGTCCCTA
TCAGGGAT


Let's say you have 20 reads with k-mers that corresponds to the AGTCCCTA strand and 30 reads that corresponds to the TCAGGGAT strand. If you don't use -C, then jellyfish will count the two strands separately and output 20 and 30 counts for the two k-mers. If you use the -C flag, then jellyfish will count 50 total for this "canonical" k-mer representing both strands.