Question: K-mers using JEllyfish
1
gravatar for itsthegodzgarbage
5.0 years ago by
Ireland
itsthegodzgarbage10 wrote:

Hi,

I am trying to find all k-mers in my fasta files. For eg if the sequence is AAATTCCGGGGGAAAA , if I want all k-mers with k=3 , it should return 64 values from what I expect. I can write a simple python script to do this but I am dealing with lot of data and wanted to use Jellyfish. It works well but doesnt include and evaluate all possible combinations. I am interested in k-mers of size 1-5. Would anyone know about this?

Anjali

ADD COMMENTlink modified 23 months ago by SmallChess480 • written 5.0 years ago by itsthegodzgarbage10

Just for clarification. I think it should not return 64 kmers. It should return the following.

​​GGG     3
AAA     3
GAA     1
GGA     1
CCG     1
TTC     1
AAT     1
CGG     1
TCC     1
ATT     1
ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Prakki Rama2.3k
2
gravatar for Rob
5.0 years ago by
Rob3.3k
United States
Rob3.3k wrote:

Hi Anjali,

   Jellyfish (and most k-mer counters I know of for that matter), only return results for k-mers that are actually present in the data.  If there are only M distinct k-mers in your data, the counts of all other 4^k - M k-mers are implicitly 0.  This helps keep the size of intermediate files manageable, since, for larger k-mer sizes most data is *very* sparse and reporting all absent k-mers with a count of 0 would waste an enormous amount of space.  While Jellyfish and other k-mer counters are designed for speed, and will scale well to large files and large k-mer sizes, if you're only interested in 1-5-mers a simple solution with a direct lookup table (array) mapping the k-mer id to an atomic integer of counts should be fairly fast (if implemented in C/C++ with multiple threads, it may even be faster than some existing counters since it's more specific in scope).  Of course, you could always just run Jellyfish for these values and create a simple Python script to expand it's resulting file format (which lists only k-mers that are present) into a format listing the results for all 4^k k-mers.  That should also be sufficiently fast, and should be somewhat simpler to set up.

--Rob

 

ADD COMMENTlink written 5.0 years ago by Rob3.3k
1
gravatar for edrezen
5.0 years ago by
edrezen720
France
edrezen720 wrote:

Hi,

Another suggestion is to use DSK (a k-mer counter with low memory footprint). It now uses HDF5 as output format, so you can use HDF5 tools to extract information of the kmers counts. You can find several examples in the README file. It also provides a dsk2ascii binary that dumps couples [kmer,count].

Erwan

 

ADD COMMENTlink written 5.0 years ago by edrezen720
0
gravatar for itsthegodzgarbage
5.0 years ago by
Ireland
itsthegodzgarbage10 wrote:

Thanks Rob. Eventually I just used a simple python script:)

ADD COMMENTlink written 5.0 years ago by itsthegodzgarbage10
0
gravatar for Prakki Rama
5.0 years ago by
Prakki Rama2.3k
Singapore
Prakki Rama2.3k wrote:

You can also count kmers using EMBOSS wordcount.

wordcount test.fa -wordsize=3 test.out

~Prakki Rama.

 

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Prakki Rama2.3k
0
gravatar for SmallChess
23 months ago by
SmallChess480
Australia
SmallChess480 wrote:

Just use DSK. The software can take a single FASTA file.

dsk -abundance-min 0 -file A.V.10.fa -out ABCD
dsk2ascii -file ABCD.h5 -out ABCD.txt

Remember to specify the minimum abundance to zero.

ADD COMMENTlink written 23 months ago by SmallChess480
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1096 users visited in the last hour