Question: GATB-core kmer couting
0
gravatar for elebanjar
3 months ago by
elebanjar10
elebanjar10 wrote:

I recently started using the GATB-core library for counting kmers in reads. Similar to the example code given in "kmer9.cpp" in the Git-Repo, I'm using SortingCountAlgorithm for counting the kmers. Now my (very basic) question: given a specific kmer sequence, is there any way to directly look up its abundance computed by the algorithm (or do I need to iterate through the computed [kmer, abundance] pairs until I find the kmer in question)? Thanks in advance!

gatb gatb-core kmer-counting • 165 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by elebanjar10
3
gravatar for Rayan Chikhi
3 months ago by
Rayan Chikhi1.4k
France, Lille, CNRS
Rayan Chikhi1.4k wrote:

Hi,

Yes it's possible in GATB but you'd need to build a de Bruijn graph first. See this example: https://github.com/GATB/gatb-core/blob/master/gatb-core/examples/debruijn/debruijn26.cpp

Note that this mechanism doesn't allow to determine if a k-mer is truly in the graph or not. GATB will return the correct abundance only if the k-mer was previously present in the sample the graph was constructed from.

best,

Rayan

ADD COMMENTlink written 3 months ago by Rayan Chikhi1.4k

Thank you for the quick reply, that helps already! In my setting, I don't know beforehand whether a specific kmer would be present in the reads (i.e. the graph), since I have a fixed set of kmers for which I want to know how often they occur in the reads. Using the approach you suggested, is there a way to check if a kmer sequence is present in the graph to make sure I only look up abundances for those that are actually in the graph?

ADD REPLYlink written 3 months ago by elebanjar10

If you can tolerate that some of the answers for query k-mers will be wrong: then you can use GATB as-is and it will often return the right answer, but with a small probability (can be tuned to be arbitrarily very small) GATB will return that a k-mer is present in the graph when in fact it is not.

If you need an exact answer for each query (i.e. cannot tolerate any mistake): unfortunately GATB is made such that it's memory-efficient and we thus didn't implement exact graph membership queries. Because doing so would make it significantly more memory-intensive. I can recommend an alternative: constructing a hash table of all k-mers, using e.g. Jellyfish, see https://github.com/gmarcais/Jellyfish/tree/master/examples/jf_count_dump

ADD REPLYlink written 3 months ago by Rayan Chikhi1.4k
1

Ok, I see. Indeed the Jellyfish approach you suggested was exactly what I was looking for. Thanks again for your help!

ADD REPLYlink written 3 months ago by elebanjar10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2651 users visited in the last hour