Question: Indexing the genome in hash tables
gravatar for wangfx
5.6 years ago by
United States
wangfx0 wrote:


I'm wondering how the designers of softwares such as GSNAP which create hash tables of the genome for alignment take into account things like long repeated sequences in the genome. 


rna-seq alignment • 2.7k views
ADD COMMENTlink written 5.6 years ago by wangfx0
gravatar for Brian Bushnell
5.6 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Typically, a kmer is a hash key and the value is the list of reference locations that it occurs.  So, a kmer from a highly repetitive sequence will refer to a long list.  The longer the list of locations for a kmer, the less valuable it is in terms of information content for a given genome, so the extremely long lists are both uninformative and take a long time to process and thus may be discarded.

Kmer-based mappers that use long kmers, like SNAP and (I suspect) ISAAC, may simply look at every location containing a kmer match, because there will not be very many matches.  Mappers that use short kmers, such as BBMap, look for areas in which multiple kmers from the same read occur near each other, because short kmers will have many more matches throughout the genome.

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1058 users visited in the last hour