Create smallest set of k-mers using degenerate bases
1
0
Entering edit mode
8 months ago
albert • 0

I have a set of k-mers, all of equal length (eg. 10bp). They currently only contain the letters A, G, T and C. I am looking for a tool or script that can find the smallest set of k-mers that encapsulates all diversity in the input set using degenerate bases.

I am aware that there are tools such as DegePrime for identifying degenerate primers; however, I do not believe they cleanly fit this purpose as they expect multisequence alignments of long regions for amplification.

Example:

Input:

AAAAAAAAAA
TAAAAAAAAA
GCGAAAAAAA


Output:

WAAAAAAAAA
GCGAAAAAAA


Requirements:

• Must scale to a large number of k-mers, eg. over 10,000
• Can use any IUPAC ambiguous base to collapse a single position
• Cannot add any new degeneracy (ie the sequences cannot expand to a sequence not in the input)
• Must optimize for smallest set size

Thanks!

Edit: As there seems to be some confusion around the requirements, I've created some additional example cases (note: they may not be perfect, eg. I may have missed a solution, so feel free to correct any of them).

# Example 1
AAAAAAAAAA
TAAAAAAAAA

# Example 2: add in an additional seq that cannot be collapsed
AAAAAAAAAA
TAAAAAAAAA
GCGAAAAAAA

# Example 3: single seq
AAAAAAAAAA

# Example 4: very similar to example 1 but uses an N to capture all 4 options for the first base
AAAAAAAAAA
TAAAAAAAAA
CAAAAAAAAA
GAAAAAAAAA

# Example 5: all permutations are OK
AAAAAAAAAA
TAAAAAAAAA
AAAAAAAAAA
ATAAAAAAAA

# Example 6: All permutations not OK so keep some seqs separate, minimizing for set size (similar to example 4)
AAAAAAAAAA
TAAAAAAAAA
CAAAAAAAAA
GAAAAAAAAA
TACAGATACA
AACAGAAAAA

# Example 7: Multiple optimal solutions
AAAAAAAAAA
TAAAAAAAAA
ACAAAAAAAA
AGAAAAAAAA

# Output 1
WAAAAAAAAA

# Output 2
WAAAAAAAAA
GCGAAAAAAA

# Output 3
AAAAAAAAAA

# Output 4
NAAAAAAAAA

# Output 5
WWAAAAAAAA

# Output 6
NAAAAAAAAA
TACAGATACA
AACAGAAAAA

# Output 7
## Option 1
WAAAAAAAAA
ASAAAAAAAA
## Option 2
AVAAAAAAAA
TAAAAAAAAA

kmer degenerate • 820 views
2
Entering edit mode

The smallest set is: NNNNNNNNNN

0
Entering edit mode

may be N{10}

0
Entering edit mode

These answers do not meet the requirement that the resulting set cannot add additional/new degeneracy. NNNNNNNNNN could be expanded to CCCCCCCCCC which is not in the original set, therefore this is not the correct solution.

0
Entering edit mode

What is the difference between this:

WAAAAAAAAA
GCGAAAAAAA


and this:

AAAAAAAAAA
KMRAAAAAAA


I'm unsure if there is a one-to-one mapping between your starting sequences and an encoded output.

0
Entering edit mode

KMRAAAAAAA could be rewritten as {G/T}{A/C}{A/G}AAAAAAA. Expanding all permutations of this reveals:

GAAAAAAAAA
GCAAAAAAAA
GCGAAAAAAA
GAGAAAAAAA
TAAAAAAAAA
TCAAAAAAAA
etc.


Not all of these k-mers are contained in the input set and therefore this is not an acceptable solution. Reversing this optimization must not add any new sequences.

It is possible that there will be multiple optimal solutions (ie minimal sets of equal size), but they can only be expanded to the input set with no additional sequences.

1
Entering edit mode
8 months ago

Interesting problem and I took a run at it here:

https://github.com/alexpreynolds/kmer-collapse

I would be curious to know if it scales to the size of your input set. Hope this is useful to you, in any case.