find unique sequences among a set of fasta entries
1
0
Entering edit mode
17 months ago

What is the best way to determine arbitrary-length uniqueness for sequences?

Let's say I have 100 DNA sequences, ranging from 300bp to 100kb. I want to know all the regions of each sequence that is unique among this set. The individual sequences contain significant repeat DNA, so I want to know which regions are NOT part of repeat DNA, and not found in the other sequences. I am also interested in finding the unique regions within this set that are not present within the entire human genome.

My first thought was to just blast each pair of sequences, and keep track of unaligned regions. But this seemed inefficient.

alignment BLAST genome repeat • 481 views
1
Entering edit mode

Way you described would be the way to do it even if it seems inefficient. Since your sequences are wide ranging in size other redundancy methods will likely not work.

0
Entering edit mode
17 months ago
Mensur Dlakic ★ 20k

I suggest you try to count k-mers. Pick a decent size, say 31, and find all unique k-mers of that size. Once you do, map them to your sequences, and look for clusters of unique k-mers which will signify unique sequences. Or you can start with longer k-mers and not look for clusters at all.

0
Entering edit mode

This may help:

0
Entering edit mode

kmercountexact.sh from BBMap suite can also be used for this.

What I am not sure of is how one maps k-mers to sequence since it is going to be a significant bookkeeping task. Unless I am missing something simple.

0
Entering edit mode