In plain English, with a background explanation, can someone explain what that -C command does? It says something about canonical, but I'm just finding this to be too ambiguous. What does that even mean? What does "both strands" mean? Is it doing a kmer analysis on a generated compliment strand from the raw reads??
Straight from "man jellyfish" (one of the lesser-known Marvel superheros):
When the orientation of the sequences in the input fasta file is not known, e.g. in sequencing reads, using
--both-strands (-C) makes the most sense.
For any k-mer m, its canonical representation is m itself or its reverse-complement, whichever comes first
lexicographically. With the option -C, only the canonical representation of the mers are stored in the hash
and the count value is the number of occurrences of both the mer and its reverse-complement.
This flag always causes so much confusion, because many biologists like myself do not know what "canonical" in this context means. I would have thought canonical meant mapped to the forward strand or something - the 'canon' DNA sequence - but no. For example, 'TAGGGACT', although it exists in the human genome, is non-canonical (using the Jellyfish definition), because it's lexographically after 'AGTCCCTA', and thus it will get rev.comp'd before adding it to Jellyfish's hashtable. I think it was mainly thought of to save RAM, as jelly only needs to store 1/2 as many DNA fragments when the k-mer size is small, and not anything to do with 'fixing' reverse complimented reads from sequencing. If you wanted to solve that specific problem, you'd map your reads then have jelly parse the BAM file, using the flag for reverse strand to rev. comp. when needed.
Reads from genomic libraries will be from either strand of the DNA. If you don't include the -C flag, then k-mers from each strand will be counted separately, even though they should really be counted together.
For example, if you have a genomic segment represented by these two complementary strands:
Let's say you have 20 reads with k-mers that corresponds to the AGTCCCTA strand and 30 reads that corresponds to the TCAGGGAT strand. If you don't use -C, then jellyfish will count the two strands separately and output 20 and 30 counts for the two k-mers. If you use the -C flag, then jellyfish will count 50 total for this "canonical" k-mer representing both strands.