Find Substring of Trie Keys
1
0
Entering edit mode
8.9 years ago
malonge11 • 0

This seems like it should be a common problem yet I can't seem to find anything that exactly fits my needs.

I have 2 sequence(fasta) files. One is an assembly of the other, so I would like to make a trie or suffix tree out of the assembled sequences, and search each for the sub sequences.

For example:

s1 = 'ATTCCG'
s2 = 'ATT'
s3 = 'CCG'

I would like to use s1 as a key in a trie, and search it for substrings s2 and s3.

Here is what I have tried so far:

1. Bio pythons trie and triefind:

This only allows for the whole key to match, or for a prefix to match. So in the above example, I could only successfully search s2 but not s3.

2. Suffix tree for python.

I looked at numbers 2 and 4 from this but neither seemed capable of handling the immense string sizes that I am working with (each string can be up to a few million characters long).

If possible, I would prefer to use the Trie because I have used it on this sized dataset before with success. Is there a way to search for substrings that are not whole matches or prefixes?

If not, what is a suitable suffix tree library that can handle very large strings?

I am using a linux 24 core 128 GB RAM machine.

Thank you

Assembly • 2.4k views
ADD COMMENT
2
Entering edit mode
8.9 years ago
Rayan Chikhi ★ 1.5k

You could start by looking at suffix arrays.. I'd recommend you take a look at wikipedia or one of the classical books for string algorithms or bioinformatics ("Algorithms on Strings, Trees and Sequences" by Gusfield, or, http://genome-scale.info is another recent one)

ADD COMMENT
0
Entering edit mode

Gusfield's book discusses how to convert a suffix tree into a suffix array, and then search for pattern P in suffix array T in O(n + log m) time. See pgs. 149-155.

ADD REPLY
0
Entering edit mode

Are non compressed suffix arrays sufficiently smaller in size so as to allow genome scale indexing?

ADD REPLY
0
Entering edit mode

A suffix array is just one integer (in a naive implementation, 8 bytes) per genome position. You mentioned to have 128 GB of RAM, so, a human genome should fit.

ADD REPLY

Login before adding your answer.

Traffic: 2620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6