Question

K-mers using JEllyfish

1

Entering edit mode

10.2 years ago

itsthegodzgarbage ▴ 10

Hi,

I am trying to find all k-mers in my fasta files. For eg if the sequence is AAATTCCGGGGGAAAA , if I want all k-mers with k=3 , it should return 64 values from what I expect. I can write a simple python script to do this but I am dealing with lot of data and wanted to use Jellyfish. It works well but doesnt include and evaluate all possible combinations. I am interested in k-mers of size 1-5. Would anyone know about this?

Anjali

k-mers jellyfish counters sequence • 7.2k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by itsthegodzgarbage ▴ 10

0

Entering edit mode

Thanks Rob. Eventually I just used a simple python script:)

ADD REPLY • link 10.2 years ago by itsthegodzgarbage ▴ 10

0

Entering edit mode

Just for clarification. I think it should not return 64 kmers. It should return the following.

GGG     3
AAA     3
GAA     1
GGA     1
CCG     1
TTC     1
AAT     1
CGG     1
TCC     1
ATT     1

ADD REPLY • link updated 4.6 years ago by Ram 44k • written 10.2 years ago by Prakki Rama ★ 2.7k

Ram · Answer 1 · 2014-06-10

Hi Anjali,

Jellyfish (and most k-mer counters I know of for that matter), only return results for k-mers that are actually present in the data. If there are only M distinct k-mers in your data, the counts of all other 4^k - M k-mers are implicitly 0. This helps keep the size of intermediate files manageable, since, for larger k-mer sizes most data is very sparse and reporting all absent k-mers with a count of 0 would waste an enormous amount of space. While Jellyfish and other k-mer counters are designed for speed, and will scale well to large files and large k-mer sizes, if you're only interested in 1-5-mers a simple solution with a direct lookup table (array) mapping the k-mer id to an atomic integer of counts should be fairly fast (if implemented in C/C++ with multiple threads, it may even be faster than some existing counters since it's more specific in scope). Of course, you could always just run Jellyfish for these values and create a simple Python script to expand it's resulting file format (which lists only k-mers that are present) into a format listing the results for all 4^k k-mers. That should also be sufficiently fast, and should be somewhat simpler to set up.

--Rob

Ram · Answer 2 · 2014-06-17

1

Entering edit mode

10.2 years ago

edrezen ▴ 730

Hi,

Another suggestion is to use DSK (a k-mer counter with low memory footprint). It now uses HDF5 as output format, so you can use HDF5 tools to extract information of the kmers counts. You can find several examples in the README file. It also provides a dsk2ascii binary that dumps couples [kmer,count].

Erwan

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by edrezen ▴ 730

Ram · Answer 3 · 2014-06-17

0

Entering edit mode

10.2 years ago

Prakki Rama ★ 2.7k

You can also count kmers using EMBOSS wordcount.

wordcount test.fa -wordsize=3 test.out

~Prakki Rama.

ADD COMMENT • link updated 4.6 years ago by Ram 44k • written 10.2 years ago by Prakki Rama ★ 2.7k

score 0 · Answer 4 · 2017-07-05

0

Entering edit mode

7.2 years ago

scchess ▴ 640

Just use DSK. The software can take a single FASTA file.

dsk -abundance-min 0 -file A.V.10.fa -out ABCD
dsk2ascii -file ABCD.h5 -out ABCD.txt

Remember to specify the minimum abundance to zero.

ADD COMMENT • link 7.2 years ago by scchess ▴ 640