Question

How To Produce Simulated 'Synthetic' Sequences

24

Entering edit mode

14.3 years ago

Stefano Berri 4.4k

Hi.

I would like to find a program that produces simulated (Illumina GAII) reads starting from a fasta file of the genome and a series of parameters so that I can then test a method I am developing. The features (in descending order) are:

Uses a fasta file as input (or something easily produced from a fasta file)
Outputs random reads (or simulate as much as possible any known bias of GAII)
Is written in (Bio)Perl, Python, ISO C/C++ or fully open source platforms.
Produces errors (like GAII) at known, ideally tunable, rate
Allows me to specify depth of coverage and length of reads
Produces qseq or similar as output
Allows to produce paired end reads

Often methods papers have some analysis of 'synthetic' or simulated data, but they usually don't bother to publish the program to produce such data. I guess I could write a "quick and dirty" program to do it, but I'd rather not reinvent the wheel if there is something available

next-gen sequencing simulation model • 12k views

ADD COMMENT • link updated 14.3 years ago by Jonathan Manning ▴ 630 • written 14.3 years ago by Stefano Berri 4.4k

score 11 · Answer 1 · 2010-06-23

11

Entering edit mode

14.3 years ago

Brad Chapman 9.7k

In addition to MetaSim, samtools has wgsim located in the misc directory:

http://samtools.sourceforge.net/

http://bioinformatics.bc.edu/chuanglab/wiki/index.php/How_to_use_SAMtools

It's open source C, and has tunable parameters for error rates, read pair distribution and number of reads generated.

ADD COMMENT • link 14.3 years ago by Brad Chapman 9.7k

0

Entering edit mode

Thanks. That is just what I was looking for, and I already had it on my computer!

ADD REPLY • link 14.3 years ago by Stefano Berri 4.4k

score 8 · Answer 2 · 2010-06-23

8

Entering edit mode

14.3 years ago

Michael 55k

Try MetaSim. That adresses most of your features (except it's written in Java, which is maybe a plus).

ADD COMMENT • link 14.3 years ago by Michael 55k

score 7 · Answer 3 · 2010-06-23

7

Entering edit mode

14.3 years ago

Jonathan Manning ▴ 630

The author of BFAST has a read simulation program in the dnaa toolkit. A few enhancements over wgsim from samtools.

ADD COMMENT • link 14.3 years ago by Jonathan Manning ▴ 630

Ram · Answer 4 · 2010-06-23

2

Entering edit mode

14.3 years ago

User 59 13k

You might want to look at shuffleseq in EMBOSS.

This will allow you to specify an input fasta file and maintain nucleotide composition at least. You can then shred this into simulated 'reads' which should be relatively straightforward.

You might also be able to use msbar to introduce mutations, but whether they're platform appropriate or not is another question.

EDIT:

Why can't you just use an actual data set? Looking at the Abyss paper, they used experimental data :

sequence data for the genome of an African male individual (HapMap DNA identifier NA18507) (International HapMap Consortium 2003, 2007) from the NCBI short read archive (accession no. SRA000271). The sequence was generated by Illumina, Inc. using their Genome Analyzer platform (Bentley et al. 2008).

as well as synthetic data generated like this:

The first synthetic data set represented all possible error-free 36-mer paired sequences, using a fixed fragment size of 200 bp. We generated simulated reads by sliding a 200 bp window, with a step size of 1 bp, along each chromosome of the reference genome and reporting the first 36 bp and the reverse complement of the last 36 bp. This process produced a data set of perfectly tiled 72-fold read coverage of the reference genome.

ADD COMMENT • link updated 6.1 years ago by Ram 44k • written 14.3 years ago by User 59 13k

0

Entering edit mode

I am afraid this is not what I am looking for. I need that from a series of fasta sequences (the human chromosomes) produces little bits (36-72 Bp long) as if it was "sequencing" the genome

ADD REPLY • link 14.3 years ago by Stefano Berri 4.4k

0

Entering edit mode

In that case you don't want it to output 'random' reads :)

ADD REPLY • link 14.3 years ago by User 59 13k

0

Entering edit mode

@Daniel. I can't use real data because I cannot change the underlying genome. I want to make, for instance, a duplication of a chromosome arm, then simulate "random reads" from that genome (with associated noise) and then use the program I am developing to see if I can pick up the duplication and with which resolution. getting every 200 bp is no good, because then it is trivial (and not realistic)

ADD REPLY • link 14.3 years ago by Stefano Berri 4.4k

0

Entering edit mode

I found this tool which sounds similar to what you are looking for: bamsurgeon. Havent tried it yet though

ADD REPLY • link 6.0 years ago by steve ★ 3.5k