Why searching using suffix array is so slow in human genome?
1
0
Entering edit mode
8.6 years ago
chhylp123 • 0

Hi all,

I am trying to locate patterns using suffix array. Thus, I have implemented a plaint suffix array, which searches patterns exploiting binary search. In small reference genomes like caenorhabditis elegans, it ran fast. However, it was extremely slow in human genome. Searching a pattern of length 20 needs more than 1 minutes.

I know binary search in large genomes will incur more cache misses, but I think one minute per pattern is abnormal. Please help me to solve this problem, or introduce an implement of suffix array that is suited for large genome.

Thanks

genome sequence software-error alignment • 1.5k views
ADD COMMENT
0
Entering edit mode
8.6 years ago

We can't know specifically how a custom algorithm is working. You could have any kind of bug.

But There's also good reason for this kind of search to be slow on the human genome. So maybe everything is working correctly.

The genome is very large, It should be slower than a small genome. You could test how the speed scales with this parameter and see it is non-linear.

The human genome also contains a lot of 'pathological' sequence for this task. Repetitive sections will cause your index to have unbalanced properties. Consider a centromere segment of two hundred or more "A" bases in a row. How does your algorithm handle that?

You might want to operate on a 'repeat masked' version rather than the whole build.

ADD COMMENT

Login before adding your answer.

Traffic: 2275 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6