Question: Getting sequence id from k-mers using jellyfish
gravatar for Protostome
3.7 years ago by
European Union
Protostome50 wrote:

I'm currently extracting a list of k-mers from a FastQ file, using jellyfish. In addition to the k-mers, I would also like a list of all the sequence ids (which are actually the id of the MiSeq reads) for each k-mer.

Is this something jellyfish is capable of doing? Unfortunately, couldn't find any description for that in the docs.

If not, is there a tool that is able to perform this task?

jellyfish alignment next-gen • 1.4k views
ADD COMMENTlink modified 3.7 years ago by Rob3.6k • written 3.7 years ago by Protostome50
gravatar for Rob
3.7 years ago by
United States
Rob3.6k wrote:

No, neither Jellyfish nor any other standard k-mer counter of which I am aware will provide this type of information. Remembering the record where each k-mer occurred would require a huge amount of extra resources (specifically, memory) during k-mer counting. The tools that do this are those that actually build an index on the read set (which, you should be forewarned, is typically a time and memory-consuming task). You might want to look at Gk-Arrays and BEETL. These tools will build an index on a set of reads that allows you to query for a specific k-mer and get a list of all of the reads in which it occurs.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Rob3.6k

Thanks Rob. I think the best approach is to iterate these k-mers and keep a list of reads per k-mer off - memory (SQLite is probably the easiest method)

ADD REPLYlink written 3.7 years ago by Protostome50

If you know what k-mers you're interested in ahead of time, and it's a reasonably-sized set, then an approach like this would work well. You have your set of k-mers in a hash, you do a linear scan of the file, and for each k-mer of interest you encounter, you maintain a list of the reads where it occurred. If you want to do this for all k-mers, then building e.g. an SQL-lite database should "work", it just may end up being slow / huge. The benefit of the indices I mentioned above is that they are relatively compact w.r.t the amount of information they contain (and the queries they can answer), so the should work well even for very large read sets. However, if your FASTQ files aren't too huge, a simpler approach should work just fine.

ADD REPLYlink written 3.7 years ago by Rob3.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 762 users visited in the last hour