Question: What Ngs Read Simulators Are Available For Paired-End Data?
19
gravatar for Aaronquinlan
9.3 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

Hi all, I need to create simulated paired-end sequence data with fixed read-lengths on each end (e.g., 75mers on each end of a 500bp DNA fragment, a la Illumina). Does anyone know of a reliable simulator that can generate paired-end sequences to a requested depth, with a requested insert size/variance and error rate, for a requested genome in a FASTA file? The output would preferably be two FASTQ files, one for each end.

I can write my own, but do not want to re-invent this boring (though useful) wheel. Any clues?

ADD COMMENTlink modified 4.3 years ago by Brian Bushnell16k • written 9.3 years ago by Aaronquinlan11k

See also the following thread discussing read simulation with quality scores: http://bit.ly/kNePbA

ADD REPLYlink written 8.5 years ago by Botond Sipos1.7k
23
gravatar for iw9oel_ad
9.3 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

samtools wgsim does most of what you request:

Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.10]
         -X FLOAT      probability an indel is extended [0.30]
         -c            generate reads in color space (SOLiD reads)
         -C            show mismatch info in comment rather than read name
         -h            haplotype mode

Note: For SOLiD reads, the first read is F3 and the second is R3.
ADD COMMENTlink modified 11 weeks ago by RamRS24k • written 9.3 years ago by iw9oel_ad6.0k

Perfect. I hadn't looked in the misc/ directory in awhile and I never saw documentation for this. Thanks Keith!

ADD REPLYlink written 9.3 years ago by Aaronquinlan11k
8
gravatar for Istvan Albert
9.3 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

MetaSim may be a good option. It has platform specific error modeling and that makes it suited for generating realistic input data rather than "perfectly" random reads.

ADD COMMENTlink written 9.3 years ago by Istvan Albert ♦♦ 81k

another solid choice, thank you.

ADD REPLYlink written 9.3 years ago by Aaronquinlan11k
5
gravatar for Jorjial
9.1 years ago by
Jorjial280
Valencia, Spain
Jorjial280 wrote:

You can also try dwgsim. This is a fork of the SAMtools wgsim and its creator is Nils Homer.

Usage:   dwgsim [options] <in.ref.fa> <out.bwa.read1.fq> <out.bwa.read2.fq> <out.bfast.fq>

Options: -e FLOAT      base error rate [0.020]
         -E FILE       base/color error rate file
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.10]
         -X FLOAT      probability an indel is extended [0.30]
         -n INT        maximum number of Ns allowed in a given read[0]
         -c            generate reads in color space (SOLiD reads)
         -h            haplotype mode
ADD COMMENTlink modified 11 weeks ago by RamRS24k • written 9.1 years ago by Jorjial280

From my experience dwgsim is much better that its predecessor wgsim. The former has some nice features and seem to be maintaned. wgsim as of now had the last commit years ago.

ADD REPLYlink written 4 weeks ago by pom0
2
gravatar for Daniel Swan
4.3 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

pIRS: Profile-based Illumina pair-end reads simulator.

Or ART

Or simNGS

There's more on this OmicsTools page.

ADD COMMENTlink modified 11 weeks ago by RamRS24k • written 4.3 years ago by Daniel Swan13k
1
gravatar for Ketil
9.1 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

Note the difference between Illumina's paired ends (just reading from each end of a clone), and circularized clones (mate pairs), which give longer inserts, but different directions - and probably more artifacts like chimerae.

(BTW, I've written a simulator for 454 data (flowsim), feel fee to contact me if you're interested in seeing this extended to paired end - or rather, mate paired - sequences.)

ADD COMMENTlink written 9.1 years ago by Ketil4.0k
0
gravatar for sacha
4.4 years ago by
sacha1.8k
France
sacha1.8k wrote:

I don't not understand how you set the depth with wgsim ?

ADD COMMENTlink written 4.4 years ago by sacha1.8k

via read length, number of reads and the length of the input sequence?

ADD REPLYlink written 4.4 years ago by Aerval280
0
gravatar for Brian Bushnell
4.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

RandomReads, in the BBMap package, supports paired-ends. For example:

randomreads.sh ref=ref.fa out=reads.fq paired interleaved reads=100k length=150 mininsert=200 maxinsert=400 gaussian
ADD COMMENTlink modified 11 weeks ago by RamRS24k • written 4.3 years ago by Brian Bushnell16k
3

I have started to have the feeling that everything is implemented in the BBMap package :-) 

ADD REPLYlink written 4.3 years ago by Istvan Albert ♦♦ 81k

That's my ultimate goal...  haven't quite reached it yet!

ADD REPLYlink written 4.3 years ago by Brian Bushnell16k

Hi Brian! Is it possible to generate reads in specific intervals? WES-like read simulation?

ADD REPLYlink written 2.4 years ago by user230613280

No, unfortunately not. You'd have to use something like bedtools to pull out the exome fasta using the genome fasta and the bait coordinates, and then use RandomReads on the result. I don't currently have anything to parse bed, but that does seem like a good addition to RandomReads.

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Thank you for the fast answer. I'll try the bedtools pre-step. Another issue.. I've realised that in PE mode, the names of the output reads in the two files are not paired, is there any option for this?

ADD REPLYlink written 2.4 years ago by user230613280
1

Yes - add the flag "illuminanames".

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Is that possible to generate RNA-seq reads from BBmap?

ADD REPLYlink written 16 months ago by k.kathirvel93210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1694 users visited in the last hour