Question

Indexing the genome in hash tables

0

Entering edit mode

8.8 years ago

wangfx • 0

Hello,

I'm wondering how the designers of software such as GSNAP which create hash tables of the genome for alignment take into account things like long repeated sequences in the genome.

Thanks!

alignment RNA-Seq • 3.4k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.8 years ago by wangfx • 0

Ram · Answer 1 · 2015-07-27

Typically, a kmer is a hash key and the value is the list of reference locations that it occurs. So, a kmer from a highly repetitive sequence will refer to a long list. The longer the list of locations for a kmer, the less valuable it is in terms of information content for a given genome, so the extremely long lists are both uninformative and take a long time to process and thus may be discarded.

Kmer-based mappers that use long kmers, like SNAP and (I suspect) ISAAC, may simply look at every location containing a kmer match, because there will not be very many matches. Mappers that use short kmers, such as BBMap, look for areas in which multiple kmers from the same read occur near each other, because short kmers will have many more matches throughout the genome.