Question: In Jellyfish, what does the "-C" switch do?
1
gravatar for Tom
3.3 years ago by
Tom20
United States
Tom20 wrote:

In plain English, with a background explanation, can someone explain what that -C command does? It says something about canonical, but I'm just finding this to be too ambiguous. What does that even mean? What does "both strands" mean? Is it doing a kmer analysis on a generated compliment strand from the raw reads??

jellyfish kmer assembly • 1.0k views
ADD COMMENTlink modified 2.2 years ago by Biostar ♦♦ 20 • written 3.3 years ago by Tom20
2
gravatar for John
3.3 years ago by
John12k
Germany
John12k wrote:

Straight from "man jellyfish" (one of the lesser-known Marvel superheros):

 

   ORIENTATION
       When the orientation of the sequences in the input fasta file is not known, e.g. in  sequencing  reads,  using
       --both-strands (-C) makes the most sense.

       For  any  k-mer  m,  its canonical representation is m itself or its reverse-complement, whichever comes first
       lexicographically. With the option -C, only the canonical representation of the mers are stored  in  the  hash
       and the count value is the number of occurrences of both the mer and its reverse-complement.

 

This flag always causes so much confusion, because many biologists like myself do not know what "canonical" in this context means. I would have thought canonical meant mapped to the forward strand or something -  the 'canon' DNA sequence - but no. For example, 'TAGGGACT', although it exists in the human genome, is non-canonical (using the Jellyfish definition), because it's lexographically after 'AGTCCCTA', and thus it will get rev.comp'd before adding it to Jellyfish's hashtable. I think it was mainly thought of to save RAM, as jelly only needs to store 1/2 as many DNA fragments when the k-mer size is small, and not anything to do with 'fixing' reverse complimented reads from sequencing. If you wanted to solve that specific problem, you'd map your reads then have jelly parse the BAM file, using the flag for reverse strand to rev. comp. when needed.

ADD COMMENTlink modified 3.2 years ago • written 3.3 years ago by John12k
1

Perhaps from a biological perspective "Canonical" is misleading, but it's definitely the right terminology from a math / computer-science perspective.  Thanks for sharing your perspective, though - I was not aware that it caused any confusion.

ADD REPLYlink written 3.3 years ago by Brian Bushnell16k

Ah is that where it comes from! :) Sorry, i'm a comic-book nerd who was never very good at maths, so maybe canonical was only non-obvious to me.
But the --both-strands flag makes it very clear, plus the man page I think makes it clear too. It's only the "jellyfish count --help" I think which is confusing, because the description of -C is:

 -C, --canonical      Count both strand, canonical representation (false)

Maybe something like:
"Rev.comp. mer if result is alphabetically sooner. Better reflects strand ambiguity."

Its a tough one to cram into < 100 char though! Maybe just put 'see man page', haha :)

ADD REPLYlink modified 3.2 years ago • written 3.3 years ago by John12k

Wouldn't not counting a k-mer fragment that has a matching reverse compliment already in the jellyfish hash table for k-mers mean that you would inadvertently skip sequences? What if a section of the genome "just so happens" to have a sequence that matches a reverse compliment of a k-mer already in the hash table?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Tom20

Yes, there are lots of regions in the genome which are just "TTTTTTTTTTTTTTTTTT" for example, which will always get turned to AAA.., but the idea was that in situations where you don't know the strand (single end sequencing which maps to both strands just as well - or reads which haven't/can't be mapped to a genome for some reason) it makes no-sense to treat AAA- any differently from TTT-, so combining them into one makes sense.

But again, in situations where you could map and then discern the strand, it would make more sense to rev. comp. what you know to be on the reverse strand and drop the few reads you are unsure of. But not everyone has this luxury...

ADD REPLYlink written 3.2 years ago by John12k
0
gravatar for Damian Kao
3.3 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Reads from genomic libraries will be from either strand of the DNA. If you don't include the -C flag, then k-mers from each strand will be counted separately, even though they should really be counted together. 

For example, if you have a genomic segment represented by these two complementary strands:

AGTCCCTA
TCAGGGAT

Let's say you have 20 reads with k-mers that corresponds to the AGTCCCTA strand and 30 reads that corresponds to the TCAGGGAT strand. If you don't use -C, then jellyfish will count the two strands separately and output 20 and 30 counts for the two k-mers. If you use the -C flag, then jellyfish will count 50 total for this "canonical" k-mer representing both strands.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Damian Kao15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 818 users visited in the last hour