Question: What Ngs Read Simulators Are Available For Paired-End Data?
17
gravatar for Aaronquinlan
7.9 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

Hi all, I need to create simulated paired-end sequence data with fixed read-lengths on each end (e.g., 75mers on each end of a 500bp DNA fragment, a la Illumina). Does anyone know of a reliable simulator that can generate paired-end sequences to a requested depth, with a requested insert size/variance and error rate, for a requested genome in a FASTA file? The output would preferably be two FASTQ files, one for each end.

I can write my own, but do not want to re-invent this boring (though useful) wheel. Any clues?

ADD COMMENTlink modified 3.0 years ago by Brian Bushnell15k • written 7.9 years ago by Aaronquinlan10k

See also the following thread discussing read simulation with quality scores: http://bit.ly/kNePbA

ADD REPLYlink written 7.2 years ago by Botond Sipos1.6k
22
gravatar for iw9oel_ad
7.9 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

samtools wgsim does most of what you request:

Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.10]
         -X FLOAT      probability an indel is extended [0.30]
         -c            generate reads in color space (SOLiD reads)
         -C            show mismatch info in comment rather than read name
         -h            haplotype mode

Note: For SOLiD reads, the first read is F3 and the second is R3.
ADD COMMENTlink written 7.9 years ago by iw9oel_ad6.0k

Perfect. I hadn't looked in the misc/ directory in awhile and I never saw documentation for this. Thanks Keith!

ADD REPLYlink written 7.9 years ago by Aaronquinlan10k
7
gravatar for Istvan Albert
7.9 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

MetaSim may be a good option. It has platform specific error modeling and that makes it suited for generating realistic input data rather than "perfectly" random reads.

ADD COMMENTlink written 7.9 years ago by Istvan Albert ♦♦ 77k

another solid choice, thank you.

ADD REPLYlink written 7.9 years ago by Aaronquinlan10k
4
gravatar for Jorjial
7.7 years ago by
Jorjial250
Valencia, Spain
Jorjial250 wrote:

You can also try dwgsim. This is a fork of the SAMtools wgsim and its creator is Nils Homer.

Usage:   dwgsim [options] <in.ref.fa> <out.bwa.read1.fq> <out.bwa.read2.fq> <out.bfast.fq>

Options: -e FLOAT      base error rate [0.020]
         -E FILE       base/color error rate file
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.10]
         -X FLOAT      probability an indel is extended [0.30]
         -n INT        maximum number of Ns allowed in a given read[0]
         -c            generate reads in color space (SOLiD reads)
         -h            haplotype mode
ADD COMMENTlink written 7.7 years ago by Jorjial250
1
gravatar for Ketil
7.7 years ago by
Ketil3.9k
Germany
Ketil3.9k wrote:

Note the difference between Illumina's paired ends (just reading from each end of a clone), and circularized clones (mate pairs), which give longer inserts, but different directions - and probably more artifacts like chimerae.

(BTW, I've written a simulator for 454 data (flowsim), feel fee to contact me if you're interested in seeing this extended to paired end - or rather, mate paired - sequences.)

ADD COMMENTlink written 7.7 years ago by Ketil3.9k
1
gravatar for Daniel Swan
3.0 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

http://www.ncbi.nlm.nih.gov/pubmed/22508794

pIRS: Profile-based Illumina pair-end reads simulator.

Or ART:

http://www.niehs.nih.gov/research/resources/software/biostatistics/art/

Or simNGS:

http://www.ebi.ac.uk/goldman-srv/simNGS/

There's more on this OmicsTools page: http://omictools.com/read-simulators-c1444-p1.html

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Daniel Swan13k
0
gravatar for sacha
3.1 years ago by
sacha1.3k
France
sacha1.3k wrote:

I don't not understand how you set the depth with wgsim ?

ADD COMMENTlink written 3.1 years ago by sacha1.3k

via read length, number of reads and the length of the input sequence?

ADD REPLYlink written 3.1 years ago by Aerval270
0
gravatar for Brian Bushnell
3.0 years ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

RandomReads, in the BBMap package, supports paired-ends.  For example:

randomreads.sh ref=ref.fa out=reads.fq paired interleaved reads=100k length=150 mininsert=200 maxinsert=400 gaussian

 

ADD COMMENTlink written 3.0 years ago by Brian Bushnell15k
3

I have started to have the feeling that everything is implemented in the BBMap package :-) 

ADD REPLYlink written 3.0 years ago by Istvan Albert ♦♦ 77k

That's my ultimate goal...  haven't quite reached it yet!

ADD REPLYlink written 3.0 years ago by Brian Bushnell15k

Hi Brian! Is it possible to generate reads in specific intervals? WES-like read simulation?

ADD REPLYlink written 13 months ago by user230613260

No, unfortunately not. You'd have to use something like bedtools to pull out the exome fasta using the genome fasta and the bait coordinates, and then use RandomReads on the result. I don't currently have anything to parse bed, but that does seem like a good addition to RandomReads.

ADD REPLYlink written 13 months ago by Brian Bushnell15k

Thank you for the fast answer. I'll try the bedtools pre-step. Another issue.. I've realised that in PE mode, the names of the output reads in the two files are not paired, is there any option for this?

ADD REPLYlink written 13 months ago by user230613260
1

Yes - add the flag "illuminanames".

ADD REPLYlink written 13 months ago by Brian Bushnell15k

Is that possible to generate RNA-seq reads from BBmap?

ADD REPLYlink written 21 days ago by k.kathirvel9350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 581 users visited in the last hour