Question

Hash Size Vs Read Identity, Mosaik Or Other Ngs Mapping Tools

2

Entering edit mode

13.4 years ago

Rm 8.3k

I working on Illumina reads mapping using Mosaik. From the manuals and literature, I learnt that "hash size" plays a role in mapping efficiency to reference, as well as computational time of run.

What I learnt (correct me if Iam worng): larger hash size -->less efficient mapping and shorter the run time.

But when Iam trying to standardise the Hash size for mapping to Drosophila genome. With no mismatches allowed and complete (100%) alignment of reads to reference. I am getting way more number of reads mapped to the reference at larger (hs-17) Hash size (compared to hs-16 to hs 11) and It runs many times faster.

I am little sceptical with this observation. Does "not allowing" any mismatches increases mapping efficiency? and also does number of hash positions per seed affect? I appreciate some light on this topic.

next-gen sequencing mapping • 3.4k views

ADD COMMENT • link updated 13.4 years ago by Mrawlins ▴ 430 • written 13.4 years ago by Rm 8.3k

score 3 · Answer 1 · 2010-11-22

It seems that the number associates with hs is not the size of the hash per se, but the size of nucleotide oligomer that gets put into the hash. So hs 17 will be a smaller hash of 17-bp sequences than hs 11 (11-bp sequences). The larger sequences will result in fewer random hits, but a sequence that's too long will have enough sequencing errors to give bad matches. That's what it seems like from their description.

Disallowing mismatches increases the algorithm's speed and memory efficiency. Thus it can consider more candidates in the same amount of time. The idea behind most of these algorithms is to look for an exact match (or near-exact match) of a "seed" sequence, then extend that seed. A fixed seed size is used, which seems to be what the hs parameter refers to. If the seed is short enough that mismatches are not expected (based on the sequencer's error rate and the mutation rate of the sample NA) then disallowing seed mismatches speeds things up with no effect on quality.

In your case, I would expect a 17-bp hash size to run faster than 11-bp, as there will be fewer potential hits for any given sequence (i.e. each sequence hashed is more likely to be unique). I would also expect it to be more accurate, since fewer random matches will be tried out and the probability of false positive/negative results decreases. I expect that trend to continue until the hash size is so large that the probability of a mutation or sequencing error becomes large, at which point it will still speed up, but lose accuracy.

By way of information, the computer scientists have a definition of the term hash size that is completely different from this use of the term hash size.