Ngs Dna Read Simulator With Quality Scores Available?
4
3
Entering edit mode
11.8 years ago
Travis ★ 2.8k

Hi all,

I am looking to generate some simulated Illumina 100bp paired end DNA reads.

I have tried a couple of options so far including SAMTools wgsim and Bfast's bgeneratereads, however neither of them simulate quality scores. Each base gets assigned either a symbol or the same number.

Is anyone aware of software that includes quality scores in its simulation?

simulation dna next-gen sequencing samtools quality • 6.7k views
4
Entering edit mode
11.8 years ago

You should definitely try simNGS: http://www.ebi.ac.uk/goldman-srv/simNGS/

It also simulates the Illumina library preparations. For the simulation, it uses real intensity files from an existing Illumina machine. The latest version has an example 101bp run from an Illumina HiSeq machine with TruSeq chemistry at Sanger.

0
Entering edit mode

The simNGS package can simulate paired-end library construction (with adjustable mean and std dev) and sample preparation errors (substitutions and indels) as well. The sample prep error rate is properly incorporated in the simulated quality scores.

2
Entering edit mode
11.8 years ago
Nilshomer ▴ 100

You can also try the dwgsim program in the DNAA package (http://dnaa.sf.net). This also has two programs to assess the sensitivity/specificity of your mapping (dwgsim_eval) and pileup (dwgsim_pileup_eval).

0
Entering edit mode

Does this do qulaity scores Nils?

2
Entering edit mode
11.8 years ago
Mitch Bekritsky ★ 1.3k

I've used MAQ simulate to get reads with simulated quality scores before. Instead of using real intensity files as simNGS does, it generates a transition matrix from fastq file(s) that it uses to simulate read quality. I like this option because it allows me to use reads that were previously obtained on the same machine, which I feel gives me quality scores that are a good representation of what I can expect in the future from the same machine or sequencing core (The sequencing facility I get my data from doesn't keep raw intensity files for more than 2 weeks).

It also creates paired-end reads with a insert size mean and std dev you can tweak, and has some other options for substitution and indel frequency.

0
Entering edit mode
11.8 years ago
Benm ▴ 710

I wrote a program before for reads simulation of NGS, including Solexa/Illumina FASTQ format, SOLiD/ABi color space format, 454/Roche fna/qual format, and it supports Paired-ends, Mate Pairs, or reads with adapter/primers/cloning vector, and enzyme digestion site, etc. and it will generate diversity/mutation including SNPs, Indels, SVs. But it still not released yet. But refer to your question, I don't think you need to focus on simulating quality scores for your simulated Illumina reads. However, I think the random quality is OK, but actually, the right most 5~20bp of the 3' of the reads will be lower, so you can follow this subroutine(PERL),

#Usage: generate_qual(\$quality,$Reads_length);
sub generate_qual
{
my ($Reads_length,$quality) = @_;
for (my $i=0;$i<$Reads_length-5;$i++)
{
quality .= chr(int(rand(36))+74);
}
for (my $i=$Reads_length-5;$i<$Reads_length;$i++) {$quality .= chr(int(rand(46))+64);
}
}


0
Entering edit mode

I uploaded my program to SourceForge: https://sourceforge.net/projects/simulateseq/files/0.2.2/