We have been trying to understand and develop tutorials for a number of genome assembly algorithms. One difficulty we are having is that the smallest data set for testing (E. coli or other bacteria) is over 12 GB 'small'. That is not too large for big computers, but quite inconvenient to just play around with a toy model.
I created a tiny data set of 8000 x 2 reads covering 1000 nt of E. coli genome (http://www.homolog.us/blogs/blog/2013/08/12/hydrogen-atom-to-learn-inner-workings-of-genome-assembly-programs/) for our tutorials, but am wondering whether such a test case is useful, or whether I should switch to something that is used by others. What do you use for teaching NGS assembly algorithms in class? If you can suggest a better alternative already in use, that will stop me from reinventing the wheel.
Here is the linked blog post to save you a click's effort :)
Those, who try to understand various algorithms (say number sorting or graph traversal), usually play with a small set and try the algorithm 'on paper' before implementing it as a program. Learning the inner workings of genome assembly algorithms in that way poses some difficulties, because popular genome assembly programs spend quite a bit of time taking care of noise in the data (tip removal, bubble popping, etc.). So, the test data set not only needs to mimic the genome sequence but also needs to include enough noise to look like real NGS library. Such a test case is hard to create, and the best solution is to take an actual NGS library of some species.
SPAdes guys made extensive use of two E. coli libraries, which we think are possibly the best realistic test cases. Other researchers use B. cereus and few other bacterial libraries for testing their programs. Those data sets are excellent for beta-testing an actual assembler, but may not be good for 'pencil and paper' analysis of the algorithms.
We created a small 'library' with ~8,000 paired end reads, covering a 1000 nt region of the the E. coli genome. When we try to assemble with small k-mer size (11, 13, 15), the reads organize into 706 nt and 218 nt regions, and then need to be scaffolded. So, the data set provides opportunities to learn about a number of assembly-related concepts. So far it helped us quite a bit to figure out the algorithms of various programs. For example, the data set was immensely helpful, when we worked on the rectangular graph algorithm. It is from a real measurement and therefore has all the expected 'noise' that you would expect in an NGS library. We like to integrate it into our tutorials on genome assembly programs, but if you have a different suggestion, please let us know. Email us, if you like to use the test data set for a class or other purpose and we will send you the files (only 200K compressed).