Question: Jellyfish -C option explanation
2
gravatar for pbigbig
3.6 years ago by
pbigbig190
United States
pbigbig190 wrote:

Hi,

When working with Jellyfish, I used sample command in their guide:

$ jellyfish count -m 22 -s 100M -t 16 -C merged.fq.gz

and I am not really understand the meaning of -C, which is described as "canonical", or (as in Jellyfish 1) "Count both strand, canonical representation". Doesn't it should be a default option (no need to put in)? What is the consequence if I do not include this -C in my jellyfish command?

I also wonder that is this option also pre-implemented in kmergenie?

Thank you in advance for your clarification!

jellyfish • 2.7k views
ADD COMMENTlink modified 3.6 years ago by Rob3.1k • written 3.6 years ago by pbigbig190
5
gravatar for Rob
3.6 years ago by
Rob3.1k
United States
Rob3.1k wrote:

When counting k-mers in sequencing reads, there is really no way to differentiate between k-mers and their reverse complement.  What I mean by this is that seeing e.g. ACGGT is equivalent to seeing ACCGT, since the latter is the reverse complement of the former and the sequenced reads don't  originate from a prescribed strand of the DNA.  The '-C' command in jellyfish considers both a k-mer and its reverse complement as equivalent, and associates the count for both (the sum of the count of a kmer and its reverse complement) with the k-mer among the two that is lexicographically smaller.  So, for example, above only ACCGT would be stored and its count would be equal to the number of occurrences of both ACCGT and ACGGT.  If you don't include '-C' in your jellyfish options, these k-mers will be treated separately.  There's nothing "wrong" with this, per-se, but it may not be what you want.

ADD COMMENTlink written 3.6 years ago by Rob3.1k
1

EDIT: I misread Robs explanation. As he stated -C collapses k-mers with its reverse complement.

>test1
AAAAATTTTT
jellyfish count -s 1M -k 5 -o withoutC.jf test.fa && jellyfish dump test.fa
>1
AAAAA
>1
AAAAT
>1
TTTTT
>1
AATTT
>1
AAATT
>1
ATTTT
jellyfish count -C -s 1M -k  5 -o withC.jf test.fa && jellyfish dump

>2
AAAAA
>2
AAAAT
>2
AAATT
ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by thackl2.6k

That's what I said.  '-C' considers both a kmer and its reverse "complement as equivalent".  This means it collapses them.  

ADD REPLYlink written 3.6 years ago by Rob3.1k
1

Yeah,  proper reading definitely helps ;) - Sorry about the that!

ADD REPLYlink written 3.6 years ago by thackl2.6k

No problem ;P

ADD REPLYlink written 3.5 years ago by Rob3.1k

Thanks a lot Rob!

ADD REPLYlink written 3.6 years ago by pbigbig190

Hi,

I just have experimented with ~30Gbyte of Miseq data, the final merged (by $ cat command) fastq file contains 3 sets of pair-end data (that means totally 6 fastq files). When run for 22mer with Jellyfish 2, with counting command included -C, I obtained ~700Mb genome size and coverage was about 27X. However, when using same command minus the -C, I obtained ~1.4Gb genome size and coverage was about 13X.

So which result is true? my data was haploid pair-end reads

Additionally, I 'm not really sure but kmergenie seem to be pre-determined to treat kmer and its reverse complement counterpart as a single block, therefore, is kmergenie always process in the same way as jellyfish with -C does?

Any suggestions and ideas are greatly welcome! Thanks

 

ADD REPLYlink written 3.5 years ago by pbigbig190

The result with -C (~700Mb) is true. Treating reverse complement kmers independently would not make sense in a shotgun data set because you expect to have 50/50 fragments from both strands.

ADD REPLYlink written 3.5 years ago by thackl2.6k

Thanks a lot

ADD REPLYlink written 3.5 years ago by pbigbig190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1936 users visited in the last hour