K-mers using JEllyfish
4
1
Entering edit mode
7.5 years ago

Hi,

I am trying to find all k-mers in my fasta files. For eg if the sequence is AAATTCCGGGGGAAAA , if I want all k-mers with k=3 , it should return 64 values from what I expect. I can write a simple python script to do this but I am dealing with lot of data and wanted to use Jellyfish. It works well but doesnt include and evaluate all possible combinations. I am interested in k-mers of size 1-5. Would anyone know about this?

Anjali

k-mers jellyfish counters sequence • 4.8k views
0
Entering edit mode

Thanks Rob. Eventually I just used a simple python script:)

0
Entering edit mode

Just for clarification. I think it should not return 64 kmers. It should return the following.

​​GGG     3
AAA     3
GAA     1
GGA     1
CCG     1
TTC     1
AAT     1
CGG     1
TCC     1
ATT     1

3
Entering edit mode
7.5 years ago
Rob 5.0k

Hi Anjali,

Jellyfish (and most k-mer counters I know of for that matter), only return results for k-mers that are actually present in the data. If there are only M distinct k-mers in your data, the counts of all other 4^k - M k-mers are implicitly 0. This helps keep the size of intermediate files manageable, since, for larger k-mer sizes most data is very sparse and reporting all absent k-mers with a count of 0 would waste an enormous amount of space. While Jellyfish and other k-mer counters are designed for speed, and will scale well to large files and large k-mer sizes, if you're only interested in 1-5-mers a simple solution with a direct lookup table (array) mapping the k-mer id to an atomic integer of counts should be fairly fast (if implemented in C/C++ with multiple threads, it may even be faster than some existing counters since it's more specific in scope). Of course, you could always just run Jellyfish for these values and create a simple Python script to expand it's resulting file format (which lists only k-mers that are present) into a format listing the results for all 4^k k-mers. That should also be sufficiently fast, and should be somewhat simpler to set up.

--Rob

1
Entering edit mode
7.5 years ago
edrezen ▴ 720

Hi,

Another suggestion is to use DSK (a k-mer counter with low memory footprint). It now uses HDF5 as output format, so you can use HDF5 tools to extract information of the kmers counts. You can find several examples in the README file. It also provides a dsk2ascii binary that dumps couples [kmer,count].

Erwan

0
Entering edit mode
7.5 years ago
Prakki Rama ★ 2.5k

You can also count kmers using EMBOSS wordcount.

wordcount test.fa -wordsize=3 test.out


~Prakki Rama.

0
Entering edit mode
4.4 years ago
SmallChess ▴ 560

Just use DSK. The software can take a single FASTA file.

dsk -abundance-min 0 -file A.V.10.fa -out ABCD
dsk2ascii -file ABCD.h5 -out ABCD.txt


Remember to specify the minimum abundance to zero.