Question

Interpreting Tallymer search output

0

Entering edit mode

2.0 years ago

jmartin • 0

I'm trying to identify unique kmers across a large collection of bacterial genomes with the goal of building primer sets specific to each genome. With this in mind I've run the tool Tallymer and have generated lists of 22mers per each genome and I'm working on a script to identify unique 22mers across my whole collection. So I'm looking at tallymer search output right now, which looks like this:

200 +712037 1   cgctgctgatcatcgccgagga
200 +712128 1   caaggctccgggcttcggtgac
200 +712129 1   aaggctccgggcttcggtgacc
200 +712130 1   aggctccgggcttcggtgaccg
201 +34857  2   ctcaaccttggcaaggttgcgc
201 +34858  2   tcaaccttggcaaggttgcgct
201 +70058  1   gttgtaggcgcgggccagacgg
201 +70059  1   ttgtaggcgcgggccagacggg
201 +70060  1   tgtaggcgcgggccagacgggt
202 +586    5   aactgtctcacgacgttctgaa

col_1: the ordinal number of the contig as it appears in the genome fasta
col_2: the position of the kmer in the current contig
col_3: the occurence count for the kmer in the index
col_4: the kmer sequence

The Tallymer documentation is a bit sparse, and I am confused as to what the occurence count (column 3) is actually telling me. From observations I can see that it is NOT the count of how many times the kmer appeared in my entire multi-record fasta file. For example, this kmer:

201 +70060 1 tgtaggcgcgggccagacgggt

occurs in my output 12 times, each time on a different contig. So this led me to believe that the occurence count is the # of times a kmer appears in that single sequence record (for contig #201 in this case). But then I see weird outputs like this next example. This output line:

202 +592 5 ctcacgacgttctgaacccagc

would lead me to believe that kmer appears 5 times in contig #202. But when I grep the full report I only see one single line for that kmer as it appears on contig #202 (with one single position). Also mysterious, I see that exact kmer appearing on 994 lines in the output. And in all 994 cases its reporting that its present exactly 5 times. No other occurence count is reported in all 994 instances of this kmer.

So I am at a loss. Can someone familiar with Tallymer explain how I should be interpreting this output? These are bacterial genomes, and there are on the order of 2-3 million 22mers reported per genome. The program completed without error, and I am using the latest release of Tallymer (which is part of the genometools package, I have genometools v1.6.2).

tallymer kmer • 508 views

ADD COMMENT • link 24 months ago by jmartin • 0

score 0 · Answer 1 · 2022-05-10

While doing further troubleshooting I discovered that I had provided the wrong fasta file when running tallymer search. The tallymer suffixerator & mkindex commands were each run on 1 single genome out of a collection of genomes (the UHGG database), but when I ran tallymer search instead of feeding it the fasta for the single genome, I fed it a fasta of all genomes in the UHGG. While this did complete without error, the output was entirely wrong. When I re-ran the tallymer search step using the correct, single genome fasta matching was was used for suffixerator & mkindex it works as expected. And I should point out that column 3 is actually the total # of occurences of the given kmer across the entire fasta (and not as I typed above the # of occurences in the contig listed on the single output line).

Sorry for the confusion, tallymer is working as advertised. Its just my brain that was not :)