Question: jellyfish kmer counting discrepancy
0
gravatar for 7ehuang
10 days ago by
7ehuang0
7ehuang0 wrote:

Hi all, I am using jellyfish to count kmers of length 31 that appear in a viral genome sequence. These were the commands I ran:

jellyfish count -m 31 -s 154675 -t 10 NC_001798.fa  # the -s parameter is based on the size of the NC_001798 genome 
jellyfish dump mer_counts.jf > mer_counts_dumps.fa

I then take a random 31mer from mer_counts_dumps.fa (e.g. GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG, which is output to have a count of 13) and grep for that same 31mer in the original input NC_001798.fa file. This is the command I run (to account for 31mers that might go across a line break):

cat NC_001798.fa | tr -d " \t\n\r" | grep -o GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG | wc -l

However, this only returns 5, which tells me that the 31mer does not appear 13 times in the fasta file (only 5 times). Does anyone know what may be causing the discrepancy? I also tried using kmercountexact.sh from the BBMap suite and it also outputs a count of 13 for this specific 31mer, so I'm wondering if my method of grepping for the 31mer in the fasta file is erroneous. I have this problem for multiple 31mers with a count greater than 1 in mer_counts_dumps.fa.

Thanks!

Best, Elaine

ADD COMMENTlink modified 9 days ago by Istvan Albert ♦♦ 81k • written 10 days ago by 7ehuang0
2
gravatar for Istvan Albert
9 days ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

The better tool to verify patterns that will include overlapping matches is the dreg tool from emboss

cat NC_001798.fa | dreg -filter -pattern GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG

produces 13 matches

#=======================================
#
# Sequence: NC_001798.2     from: 1   to: 154675
# HitCount: 13
#
# Pattern: GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
#
#=======================================

  Start     End  Strand Pattern                               Sequence
 153769  153799       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153784  153814       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153799  153829       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153814  153844       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153829  153859       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153844  153874       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153859  153889       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153874  153904       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153889  153919       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153904  153934       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153919  153949       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153934  153964       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153949  153979       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG

#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 154675
# Reported_sequences: 1
# Reported_hitcount: 13
#---------------------------------------
ADD COMMENTlink written 9 days ago by Istvan Albert ♦♦ 81k

Oh, this is great! Exactly the tool I need. Thank you very much.

ADD REPLYlink written 9 days ago by 7ehuang0
0
gravatar for swbarnes2
10 days ago by
swbarnes26.5k
United States
swbarnes26.5k wrote:

From the manual:

In sequencing reads, it is unknown which strands of the DNA is sequenced. As a consequence, a k-mer or its reverse complement are essentially equivalent.

did you count up the rev-comp of your test sequence too?

Also, I'm not totally sure if grep will find every instance of the k-mer if they overlap, I'm guessing the jellyfish authors did think of that.

ADD COMMENTlink written 10 days ago by swbarnes26.5k

Does anyone know what may be causing the discrepancy?

Reverse complement is the likely explanation. Instead of reverse-complementing your genome file, you can do it to the 31-mer:

cat NC_001798.fa | tr -d " \t\n\r" | grep -o CGCCCGACCCCCGCCCGCCCGACCCCCGCCC | wc -l

grep counts overlapping patterns as one, but it is unlikely that k-mers of this length overlap.

ADD REPLYlink written 10 days ago by Mensur Dlakic1.3k

Thank you both for your responses!

Since I am counting kmers in a genome, the reverse-complement of a kmer shouldn't be considered equivalent to it, so when I ran kmercountexact.sh, I used the rcomp=f parameter. Not considering a kmer and its reverse-complement equivalent is the default behavior for jellyfish (you would need to use the -C parameter to set that behavior). This information is from section 1.1.2 of the manual (about counting kmers in a genome).

Even when I grep for the reverse complement, I get a count of 5. So adding the count for the kmer and the reverse complement would give me 10, which is still different from the count of 13 output by jellyfish.

ADD REPLYlink written 10 days ago by 7ehuang0
2

I stand corrected that an overlap is unlikely. That statement was based only on the k-mer length (31), while a simple visual inspection shows that you have a perfect duplication of GGGCGGGGGTCGGGC in your 31-mer (plus an extra G). It seems like your random 31-mer is not so random. I suggest you test a different 31-mer with your grep command, or try counting this repetitive 15-mer.

If you do your grep command on this sequence (14 repeats of the 15-mer plus an extra G at the end):

GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG

the count of your 31-mer will be 5 even though the actual count is 7.

ADD REPLYlink written 9 days ago by Mensur Dlakic1.3k

Now that you mention it, I see that the 31mers I'm running into this problem for all have duplications of some kmer within them. Thank you very much!

ADD REPLYlink written 9 days ago by 7ehuang0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 858 users visited in the last hour