Hi, I am giving a workshop of genome assembly and I would like to have the students try genome assembly for themselves. However it will not be feasible to have tens of students performing assembly on a genome on the order of megabases. This is because it will likely be on either one server or on desktop computers, and there will be a time constraint. Is there a way to simulate an SFF for something smaller like a plasmid? Or simulate an SFF based on a neighborhood of a few operons? Thank you.
Rather than simulating an SFF (assuming you mean the 454's Standard Flowgram Format) you might be better off simulating sequences. On that topic there were some answers here: how-to-produce-simulated-synthetic-sequences
Have you tried google? You will find at least this one:
Flowsim, http://blog.malde.org/index.php/flowsim/, paper here: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935434/
Maybe you could use true data from traces archives, like SRA database (let's say a virus, like this one)? You can download fastq files (not sffs) but as far as I know Newbler can read fasta files with or without quality information (although it's possible that you would need to rescale quality scores in the first case).
The new NCBI SRA format allows you to download their SRA archives and convert it to any of the more widely vendor formats used (SFF, FASTQ, Illumina) via their SRA Toolkit, see http://www.ncbi.nlm.nih.gov/books/NBK49294/ for download and manual.
So, search for "virus" or "plasmid" in the SRA (perhaps something like http://www.ncbi.nlm.nih.gov/sra/SRX025865?report=full), download the corresponding SRA, convert it to SFF and you're done.
Note 1: the 1.0b10 toolkit has one "error" admonished by current gcc which is quickly fixed. Note 2: using plasmid or virus libraries as example for assembly may be counter productive as these things tend to be really nasty as most of the time it's not one clean DNA which was sequenced but a mixture and that can confuse assemblers quite a lot.