Indexing the genome in hash tables
1
0
Entering edit mode
8.8 years ago
wangfx • 0

Hello,

I'm wondering how the designers of software such as GSNAP which create hash tables of the genome for alignment take into account things like long repeated sequences in the genome.

Thanks!

alignment RNA-Seq • 3.4k views
ADD COMMENT
3
Entering edit mode
8.8 years ago

Typically, a kmer is a hash key and the value is the list of reference locations that it occurs. So, a kmer from a highly repetitive sequence will refer to a long list. The longer the list of locations for a kmer, the less valuable it is in terms of information content for a given genome, so the extremely long lists are both uninformative and take a long time to process and thus may be discarded.

Kmer-based mappers that use long kmers, like SNAP and (I suspect) ISAAC, may simply look at every location containing a kmer match, because there will not be very many matches. Mappers that use short kmers, such as BBMap, look for areas in which multiple kmers from the same read occur near each other, because short kmers will have many more matches throughout the genome.

ADD COMMENT

Login before adding your answer.

Traffic: 3861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6