Jellyfish -C option explanation
1
4
Entering edit mode
6.3 years ago
pbigbig ▴ 230

Hi,

When working with Jellyfish, I used sample command in their guide:

$ jellyfish count -m 22 -s 100M -t 16 -C merged.fq.gz

and I am not really understand the meaning of -C, which is described as "canonical", or (as in Jellyfish 1) "Count both strand, canonical representation". Doesn't it should be a default option (no need to put in)? What is the consequence if I do not include this -C in my jellyfish command?

I also wonder that is this option also pre-implemented in kmergenie?

Thank you in advance for your clarification!

jellyfish • 4.9k views
ADD COMMENT
8
Entering edit mode
6.3 years ago
Rob 5.0k

When counting k-mers in sequencing reads, there is really no way to differentiate between k-mers and their reverse complement.  What I mean by this is that seeing e.g. ACGGT is equivalent to seeing ACCGT, since the latter is the reverse complement of the former and the sequenced reads don't  originate from a prescribed strand of the DNA.  The '-C' command in jellyfish considers both a k-mer and its reverse complement as equivalent, and associates the count for both (the sum of the count of a kmer and its reverse complement) with the k-mer among the two that is lexicographically smaller.  So, for example, above only ACCGT would be stored and its count would be equal to the number of occurrences of both ACCGT and ACGGT.  If you don't include '-C' in your jellyfish options, these k-mers will be treated separately.  There's nothing "wrong" with this, per-se, but it may not be what you want.

ADD COMMENT
1
Entering edit mode

EDIT: I misread Robs explanation. As he stated -C collapses k-mers with its reverse complement.

>test1
AAAAATTTTT
jellyfish count -s 1M -k 5 -o withoutC.jf test.fa && jellyfish dump test.fa
>1
AAAAA
>1
AAAAT
>1
TTTTT
>1
AATTT
>1
AAATT
>1
ATTTT
jellyfish count -C -s 1M -k  5 -o withC.jf test.fa && jellyfish dump

>2
AAAAA
>2
AAAAT
>2
AAATT
ADD REPLY
0
Entering edit mode

That's what I said.  '-C' considers both a kmer and its reverse "complement as equivalent".  This means it collapses them.  

ADD REPLY
1
Entering edit mode

Yeah,  proper reading definitely helps ;) - Sorry about the that!

ADD REPLY
0
Entering edit mode

No problem ;P

ADD REPLY
0
Entering edit mode

Thanks a lot Rob!

ADD REPLY
0
Entering edit mode

Hi,

I just have experimented with ~30Gbyte of Miseq data, the final merged (by $ cat command) fastq file contains 3 sets of pair-end data (that means totally 6 fastq files). When run for 22mer with Jellyfish 2, with counting command included -C, I obtained ~700Mb genome size and coverage was about 27X. However, when using same command minus the -C, I obtained ~1.4Gb genome size and coverage was about 13X.

So which result is true? my data was haploid pair-end reads

Additionally, I 'm not really sure but kmergenie seem to be pre-determined to treat kmer and its reverse complement counterpart as a single block, therefore, is kmergenie always process in the same way as jellyfish with -C does?

Any suggestions and ideas are greatly welcome! Thanks

 

ADD REPLY
0
Entering edit mode

The result with -C (~700Mb) is true. Treating reverse complement kmers independently would not make sense in a shotgun data set because you expect to have 50/50 fragments from both strands.

ADD REPLY
0
Entering edit mode

Thanks a lot

ADD REPLY

Login before adding your answer.

Traffic: 1578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6