Question

How to find number of unique n-mers in genome?

0

Entering edit mode

8.0 years ago

endrebak ▴ 960

I want to find the number of unique n-mers in the genome, for arbitrary ns (say 30). Is there software that can do this?

If not, how could I do it without requiring excessive ram? I guess a suffix-tree would be the best option, but all the implementations I have found are either inefficient or 10 years old.

alignment sequence • 2.1k views

ADD COMMENT • link updated 8.0 years ago by John 13k • written 8.0 years ago by endrebak ▴ 960

2

Entering edit mode

Jellyfish or kmc2? Not sure if they work with chromosome-long sequences.

ADD REPLY • link 8.0 years ago by lh3 33k

0

Entering edit mode

I want to thank all responders and will upvote on monday when I get to my computer. I'll start with jellyfish since 2.0 accepts full genomes according to the docs ( Section 1.1.2 in the user guide: http://www.genome.umd.edu/docs/JellyfishUserGuide.pdf )

ADD REPLY • link 8.0 years ago by endrebak ▴ 960

0

Entering edit mode

kmercountexact.sh from BBMap may be worth a try. Some prior discussion: How to find the shortest k-mer length that is unique in a large genome.

ADD REPLY • link 8.0 years ago by GenoMax 141k

score 1 · Answer 1 · 2016-04-09

A suffix array might help. Compute SA and LCP. Then do a linear scan of the LCP array for each entry i with LCP[i] < n && LCP[i+1] < n you have found a unique n-mer (given strlen(S+SA[i]) >= n). For SA construction you might want to use https://github.com/y-256/libdivsufsort

Alternatively you could apply hashing to your kmers. There are a lot of tools out there for that job. See if one fits your needs.