Using Seqan To Build A Suffix Tree For Millions Of Reads
4
4
Entering edit mode
12.0 years ago
Weaklearner ▴ 40

I am trying to build a suffix tree for millions of reads with Seqan. Since the number of sequences is so big, I need some lazy implementation where only nodes in certain range of levels are retained. Is there any way to do this. Right now this is what I have and it ran out of memory at the construction step. It seems like the tree is not lazily evaluated at all.

int main (int argc, char const * argv[])
{
String<DnaQ> seq;

FragmentStore<> fragStore;
return 1;

for (int j = 0; j < 5; ++j) {
std::cout << seq << std::endl;

for (int i = 0; i < length(seq); ++i) {
std::cout << getQualityValue(seq[i]) << " ";
}
std::cout << std::endl;
}

typedef Index< TReadSeqStore, IndexWotd<TMyIndex> > TMyIndex;
}

next-gen sequencing assembly • 5.2k views
0
Entering edit mode

Can you provide a link to the library's source? The SeqAn website has a few different downloads, not sure exactly what you're using.

If they expose a stream interface perhaps you can inject your own layer of buffering. It sounds like your looking for a way to offload the virtual memory pressure. Having the flexibility to inject your own buffering strategy may provide your scenario with the best results.

2
Entering edit mode
11.6 years ago
lh3 33k

Google search of "lazy suffix tree" gave me a paper (PDF) 12 years ago. According to that paper, you still need 8.5 bytes per character in average. Although this is better than the implementation in MUMmer3 which according to its paper takes 12-17 byte, lazy suffix tree is still going to take 850MB per million 100bp reads. You will not be able to build the suffix tree for a whole run of reads unless you have a big memory machine.

If you want to use seqan, the better choice is Enhanced suffix array, which according to its paper takes 6.25 bytes per base for linear-time matching. This is better, though not much. It is also theoretically possible to build a suffix tree in NlogN+O(N) bits, similar to that of a suffix array (NlogN bits), using the idea of succinct data structure, but no practical implementation exists so far.

Nonetheless, to index NGS reads, the right strategy is to use a self-compressed data structure. Such a data structure may use less space than the data itself. Jared's new SGA, which is based on FM-Index, is one of the available implementations. This page presents a Javascript implementation of trie using the succinct representation of a tree. You can find a few C++ libraries with a google search "succinct trie". A trie in its simplest form supports fewer operations, but if you just want to find exact matches, it suffices.

Suffix tree is one of the earliest data structures for strings. It is in my opinion not the best for DNA data any more, largely due to its huge memory requirement. The recent advances in computer science usually result in an index of a much smaller size without a big compromise on speed.

0
Entering edit mode

Actually, typical SA and ESA only index one string as well. Seqan implements the canonical version, so it won't work with multiple reads. As of now, another choice of index reads with FM-index is my fermi assembler. Given high coverage, the size of indices generated by fermi/SGA can be less than half of input bases.

0
Entering edit mode

I was wrong. Seqan implements the generalized suffix array. Its SA/ESA works for multiple reads. But for huge data sets, sga/fermi is still preferred as both probably use less memory. They are designed to handle billions of short reads.

2
Entering edit mode
10.1 years ago
dave.weese ▴ 30

Hi there,

the lazy suffix tree that you were using, consists of a generalized suffix array and a dynamic suffix tree data structure on top. On average the overall memory footprint will be at least as big a simple generalized suffix array and roughly half as big compared to the enhanced suffix array (=SA+LCP+ChildTab). As in the beginning the dynamic structure is empty, it seems that even the sole suffix array doesn't fit into main memory. How much reads are we talking about and how much RAM do you have?

There are several means to reduce the internal memory consumption. First of all, what does your algorithm look like and how do you traverse the tree?

1) In many situations where you traverse nodes below a certain string depth, it is beneficial to use an external k-mer index first and then traverse the lazy suffix tree consisting only of suffixes beginning with the same k-mer. Then you can also use an external/mmap'ed string for the whole suffix array as these suffixes form a contiguous subinterval. The bigger k, the smaller this subinterval and the smaller you internal memory requirements.

2) You can reduce the suffix array size by overloading the type used for representing a suffix position (SAValue), e.g. to use unsigned char to represent the position within the read.

3) Work on batches of reads, as Shigeta proposed.

Just explain a bit more about what you are going to do with the suffix tree and then we'll see what option fits best.

Cheers, David

1
Entering edit mode
10.1 years ago
Manuel ▴ 400

The SeqAn library now also features a FM Index which allows for smaller index sizes at the trade-off of some performance. You will need the current trunk version at the moment and the index will be in the upcoming 1.4 release.

The Mini Bowtie Tutorial gives some examples. There also is API documentation for the FM Index.

HTH

0
Entering edit mode

While it is a great news that seqan now supports FM-index, I doubt it helps to solve OP's question. A typical FM-index is only able to encode one string, not multiple strings. From the brief documentation, it seems to me that seqan implements the typical version. If that is true, it won't be able to index reads. FM-index for multiple strings needs some tweaks. For bioinformatic applications, the only ones so far are in SGA and my fermi.

1
Entering edit mode

lh3: Thank you for this comment, the documentation must be misleading. The "text" can actually be either a string or a string set so indexing many strings at the same time is possible.

0
Entering edit mode

+1. Thanks for the clarification. I find the example now. In implementation, does seqan simply concatenate them or support multiple sentinels? In the latter case, is that right to say all seqan indices including SA and ESA all support multiple strings?

1
Entering edit mode

All indices in SeqAn support multiple strings, if they don't then this would be a bug.

I'll get back to you later with the details on the FM index later and hopefully improved documentation for the FM Index.

1
Entering edit mode

Thanks! BTW, the question of the support of multiple sentinels is really about SA (because ESA/FM-index is derived from SA): how do you construct SA for multiple strings? There are three possible ways: 1) concatenate all the strings without a separator (the bwa/bowtie/soap2/bwt-sw way); 2) concatenate by adding a NULL character to the end of each string and treat the NULL just as another ordinary character; 3) add a NULL/sentinel to the end of each string, but the lexicographical order of these sentinels is solely determined by the order of the strings (the rlcsa/SGA/fermi/BCR way). PS: hmm.. probably I should try by myself, too..

1
Entering edit mode

In fact, we don't add anything to end of the strings as we don't want to put restrictions on the string alphabet. Instead we adapted all of our suffix array algorithms to construct the generalized suffix array (contains pairs < seqNo, seqOffset >) and sort the suffixes as if there would be an appended sentinel less than all string characters and unique for each string. The holds for all SeqAn indices, even for the q-gram and FM index

0
Entering edit mode

Thanks. I know for multiple strings, keeping an SA value as an integer pair is more often seen in the literature for 3). How many bytes does each value take for indexing the human genome? Pair<uint8_t,uint32_t>? 5 bytes in raw and 8 bytes after memory alignment? I prefer the concatenation description because I can use one 32-bit integer to keep a SA value. A drawback is a position does not tell you the chr no, but usually we only retrieve chr no after we get the final alignment.

0
Entering edit mode

Yes, for the human genome you can use the fact that in total there are < 2**32 bases. SeqAn also supports the kind of positions you mentioned (we call them global positions, whereas the pair is a local position) simply by defining SAValue to be a uint32_t. However, then getSeqNo and getSeqOffset is more costly. Instead of a Pair<uint8_t,uint32_t>, I would use a packed Pair (Pair<uint8_t,uint32_t, Pack>) where memory alignment is disabled and the total consumption is 5n bytes.

PS: For the rare case, where you know in advance the maximum number of strings and their maximal lengths, the Bit Packed Pair allows to specify the number of bits used for each type.

0
Entering edit mode

When I use createSuffixArray(sa, text, algo), if the text is a StringSet, it seems that sa has to be String<Pair<unsigned,unsigned> >; otherwise I got a compiling error. How can I get a uint32_t suffix array? Does createSuffixArray() support that? If not, maybe in index, you first create Pair<uint8_t,uint32_t> and then convert it to an uint32_t array? Also, I guess Pair<uint8_t,uint32_t,Pack> is slower than Pair<uint8_t,uint32_t>? Do you know how much slower? I have not really evaluated such things before. Thanks!

0
Entering edit mode

Hi lh3,

you are right, it is possibility 3. We will update the online documentation soon.

0
Entering edit mode
11.7 years ago
Shigeta ▴ 460

The suffix tree must store the entire length of the sequence. It should help if you cut the sequence off at a certain max length, but if you have a few GB of reads, its probably going to store the entire tree in RAM and run out of space.

Not really familiar with SeqAn; I was just writing an ST in python and got interested in your question.