I'm trying to identify unique kmers across a large collection of bacterial genomes with the goal of building primer sets specific to each genome. With this in mind I've run the tool Tallymer and have generated lists of 22mers per each genome and I'm working on a script to identify unique 22mers across my whole collection. So I'm looking at tallymer search output right now, which looks like this:
200 +712037 1 cgctgctgatcatcgccgagga
200 +712128 1 caaggctccgggcttcggtgac
200 +712129 1 aaggctccgggcttcggtgacc
200 +712130 1 aggctccgggcttcggtgaccg
201 +34857 2 ctcaaccttggcaaggttgcgc
201 +34858 2 tcaaccttggcaaggttgcgct
201 +70058 1 gttgtaggcgcgggccagacgg
201 +70059 1 ttgtaggcgcgggccagacggg
201 +70060 1 tgtaggcgcgggccagacgggt
202 +586 5 aactgtctcacgacgttctgaa
col_1: the ordinal number of the contig as it appears in the genome fasta
col_2: the position of the kmer in the current contig
col_3: the occurence count for the kmer in the index
col_4: the kmer sequence
The Tallymer documentation is a bit sparse, and I am confused as to what the occurence count (column 3) is actually telling me. From observations I can see that it is NOT the count of how many times the kmer appeared in my entire multi-record fasta file. For example, this kmer:
201 +70060 1 tgtaggcgcgggccagacgggt
occurs in my output 12 times, each time on a different contig. So this led me to believe that the occurence count is the # of times a kmer appears in that single sequence record (for contig #201 in this case). But then I see weird outputs like this next example. This output line:
202 +592 5 ctcacgacgttctgaacccagc
would lead me to believe that kmer appears 5 times in contig #202. But when I grep the full report I only see one single line for that kmer as it appears on contig #202 (with one single position). Also mysterious, I see that exact kmer appearing on 994 lines in the output. And in all 994 cases its reporting that its present exactly 5 times. No other occurence count is reported in all 994 instances of this kmer.
So I am at a loss. Can someone familiar with Tallymer explain how I should be interpreting this output? These are bacterial genomes, and there are on the order of 2-3 million 22mers reported per genome. The program completed without error, and I am using the latest release of Tallymer (which is part of the genometools package, I have genometools v1.6.2).