Question: How To Produce Simulated 'Synthetic' Sequences
gravatar for Stefano Berri
8.8 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:


I would like to find a program that produces simulated (Illumina GAII) reads starting from a fasta file of the genome and a series of parameters so that I can then test a method I am developing. The features (in descending order) are:

  • Uses a fasta file as input (or something easily produced from a fasta file)
  • Outputs random reads (or simulate as much as possible any known bias of GAII)
  • Is written in (Bio)Perl, Python, ISO C/C++ or fully open source platforms.
  • Produces errors (like GAII) at known, ideally tunable, rate
  • Allows me to specify depth of coverage and length of reads
  • Produces qseq or similar as output
  • Allows to produce paired end reads

Often methods papers have some analysis of 'synthetic' or simulated data, but they usually don't bother to publish the program to produce such data. I guess I could write a "quick and dirty" program to do it, but I'd rather not reinvent the wheel if there is something available

ADD COMMENTlink written 8.8 years ago by Stefano Berri4.1k
gravatar for Brad Chapman
8.8 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

In addition to MetaSim, samtools has wgsim located in the misc directory:

It's open source C, and has tunable parameters for error rates, read pair distribution and number of reads generated.

ADD COMMENTlink written 8.8 years ago by Brad Chapman9.4k

Thanks. That is just what I was looking for, and I already had it on my computer!

ADD REPLYlink written 8.8 years ago by Stefano Berri4.1k
gravatar for Michael Dondrup
8.8 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Try MetaSim. That adresses most of your features (except it's written in Java, which is maybe a plus).

ADD COMMENTlink written 8.8 years ago by Michael Dondrup46k
gravatar for Jonathan Manning
8.8 years ago by
Near Boston, MA
Jonathan Manning620 wrote:

The author of BFAST has a read simulation program in the dnaa toolkit. A few enhancements over wgsim from samtools.

ADD COMMENTlink written 8.8 years ago by Jonathan Manning620
gravatar for Daniel Swan
8.8 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

You might want to look at shuffleseq in EMBOSS.

This will allow you to specify an input fasta file and maintain nucleotide composition at least. You can then shred this into simulated 'reads' which should be relatively straightforward.

You might also be able to use msbar to introduce mutations, but whether they're platform appropriate or not is another question.


Why can't you just use an actual data set? Looking at the Abyss paper, they used experimental data :

sequence data for the genome of an African male individual (HapMap DNA identifier NA18507) (International HapMap Consortium 2003, 2007) from the NCBI short read archive (accession no. SRA000271). The sequence was generated by Illumina, Inc. using their Genome Analyzer platform (Bentley et al. 2008).

as well as synthetic data generated like this:

The first synthetic data set represented all possible error-free 36-mer paired sequences, using a fixed fragment size of 200 bp. We generated simulated reads by sliding a 200 bp window, with a step size of 1 bp, along each chromosome of the reference genome and reporting the first 36 bp and the reverse complement of the last 36 bp. This process produced a data set of perfectly tiled 72-fold read coverage of the reference genome.

ADD COMMENTlink modified 7 months ago by RamRS21k • written 8.8 years ago by Daniel Swan13k

I am afraid this is not what I am looking for. I need that from a series of fasta sequences (the human chromosomes) produces little bits (36-72 Bp long) as if it was "sequencing" the genome

ADD REPLYlink written 8.8 years ago by Stefano Berri4.1k

In that case you don't want it to output 'random' reads :)

ADD REPLYlink written 8.8 years ago by Daniel Swan13k

@Daniel. I can't use real data because I cannot change the underlying genome. I want to make, for instance, a duplication of a chromosome arm, then simulate "random reads" from that genome (with associated noise) and then use the program I am developing to see if I can pick up the duplication and with which resolution. getting every 200 bp is no good, because then it is trivial (and not realistic)

ADD REPLYlink written 8.8 years ago by Stefano Berri4.1k

I found this tool which sounds similar to what you are looking for: bamsurgeon. Havent tried it yet though

ADD REPLYlink written 6 months ago by steve1.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1850 users visited in the last hour