which tools you use to generate artificial NGS?
4
2
Entering edit mode
9.2 years ago
Medhat 9.7k

There is a lot of tools to generate artificial NGS

Which one you use or suggest to use? Is there is any to simulate PacBio reads?

sequence software next-gen • 3.3k views
ADD COMMENT
2
Entering edit mode
9.2 years ago
thackl ★ 3.0k

I like ART for NGS. It's comprehensive and one can easily generate custom profiles from a given fastq samples, e-g- MiSeq 300bp or brand new HiSeq 250bp

Default pbsim reads come with a little more errors and shorter reads than what in my experience you get with real-life pacbio data. For convenience I use the script below to directly generate PacBio reads from a reference FASTA with length and error distributions closer to what you get from current P4-C2/P5-C3 runs. (Just put the code in a file, e.g. pbsim-clr, next to the pbsim binary and make it executable)

#!/bin/bash
BIN="$(dirname "$(readlink -f "$0")")";

COV=10;
[[ $1 == '-c' ]] && { COV=$2; shift; shift; };
[[ $# -eq 0 || $1 == "-h"* ]] && { echo -e "Usage:\n pbsim-clr [-c cov] FASTA [FASTA ..]"; exit 0; }

for FILE in $@; do
    PRE=`basename ${FILE%.*}-pb`;
    $BIN/pbsim \
        --data-type CLR \
        --depth $COV \
        --length-max 50000 \
        --length-min 1000 \
        --length-mean 7000 \
        --difference-ratio 8:62:30 \
        --accuracy-mean 0.83 \
        --model_qc $BIN/data/model_qc_clr \
        $FILE

    rm sd_0001.ref
    I=1;
    OUT=$PRE-`printf %02d $I`
    while [[ -e $OUT.fq ]]; do
        I=$(( $I + 1 ));
        OUT=$PRE-`printf %02d $I`;
    done;
    mv sd_0001.fastq $OUT.fq
    mv sd_0001.maf $OUT.maf
done;

The DAZZLER simulater only produces random sequences with PacBio length profile.

[edit] Edited to prevent any confusion about dazzler - see comments.

ADD COMMENT
0
Entering edit mode

Is "The reads must not exceed a maximum length of 36 bp?" about DAZZLER I think it uses static parameter to generate reads

It collects enough reads to cover the genome -c times and Introduces -e fraction errors into each read where the ratio of insertions, deletions, and substitutions are set by defined constants INS_RATE (default 73%) and DEL_RATE (default 20%) within generate.c. One can also control the rate at which reads are picked from the forward and reverse strands by setting the defined constant FLIP_RATE (default 50/50).

ADD REPLY
0
Entering edit mode

There is definitely no length limit on read length in the latest version ART-VanillaIceCream-03-11-2014. However, I think default Illumina profiles are only available for 36-150 bp. For longer profiles, you will need to have/download an appropriate read set and profile the FASTQ file first.

ADD REPLY
0
Entering edit mode

About DAZZLER, you are right, it's a rather simple simulator, but it found it to be sufficient for my purposes (testing pacbio alignments / pacbio correction)

ADD REPLY
1
Entering edit mode

About DAZZLER you do not use the reference genome am I right? you just use the size of it

ADD REPLY
0
Entering edit mode

Sorry for the confusion. Yes, dazzler is just random sequence. I edited my original answer. Thanks for making me aware

ADD REPLY
1
Entering edit mode
9.2 years ago
try pbsim
ADD COMMENT
1
Entering edit mode
9.2 years ago
5heikki 11k

How about wgsim from Heng Li?

ADD COMMENT
0
Entering edit mode

dwgsim is based off of wgsim found in SAMtools written by Heng Li, and forked from DNAA

ADD REPLY
0
Entering edit mode

Now I am confused as to the point of dwgsim, how is it different? I don't seem to be able to find any extensive readme for either dnaa or dwgsim

ADD REPLY
0
Entering edit mode

here is how it was written in the wiki:

Whole genome simulation can be performed with dwgsim. dwgsim is based off of wgsim found in SAMtools written by Heng Li, and forked from DNAA. It was modified to handle ABI SOLiD and Ion Torrent data, as well as various assumptions about aligners and positions of indels. Many new features have been subsequently added.

Link

dwgsin info

ADD REPLY
1
Entering edit mode
9.2 years ago

RandomReads in the BBMap package has a fairly good PacBio mode. For example:

randomreads.sh ref=reference.fa out=pb.fa reads=10000 minlength=200 maxlength=15000 pacbio=true pbmin=0.13 pbmax=0.17

...will generate 10k reads of length 200 to 15000, with error rates between 0.13 and 0.17, split between insertions, deletions, and substitutions in a ratio and with lengths in accordance with our PacBio lab data.

ADD COMMENT
0
Entering edit mode

do you have any idea about the profile it uses to generate this reads?

ADD REPLY
1
Entering edit mode

Since I wrote it, yes :) The error rate is user-defined. The error type is 40% insertion, 35% deletion, and 25% substitution. The quality scores correctly reflect the chance that a given base is in error.

ADD REPLY
0
Entering edit mode

How can I control the coverage? or reads=10000 is that means that I will have 10k reads covering all the genome so I need to multiply it by the mean read "200+15000/2" then divided by the genome size and have the coverage? is it CLR or CCS?

Meanwhile I tried to use the program with chromosome 3 in rice like this:

bbmap6/randomreads.sh -Xmx10g \
    ref=/data/genomes_reference/rice/Oryza_sativa.IRGSP-1.0.27.dna.chromosome.3.fa \
    out=/data/genomes_artificial/pacbio/rice/rice_pacbio.fa \
    reads=95826 minlength=200 maxlength=15000 \
    pacbio=true pbmin=0.13 pbmax=0.17

in the info file:

#chrom  scaffolds       contigs length  defined undefined       startPad        stopPad
1       1       12      36429819        36404619        25200   8000    8000

Why I have 12 scaffolds? also in the rice_pacbio.fa file I have the reads by number not by their genomic origin?!!! and it is from 0 to 95824 "which is equal to the number of reads reads=95826"

ADD REPLY
0
Entering edit mode

reads=X will output X reads (unless you are outputting paired reads, which you aren't).

So yes, the coverage is (reads*(minlength+maxlength)/2)/(genome size)

These reads will be similar to filtered subreads. If you want something closer to ccs, then decrease pbmin and pbmx, which control the bounds of the error rates for individual reads. 3-pass ccs reads might be closer to pbmin=0.4 pbmax=0.7.

As for the info file, you are looking at the wrong column - there is 1 scaffold which has 12 contigs. The number of contigs is not really important.

The read names are a little bit unclear, but... here's a sample read name:

0_chr1_0_407422_407571_399422_ecoli

To break that down, it is:

A_B_C_D_E_F_G

where:

  • A = 0: this is the read number
  • B = chr1: This is for the internal coordinate system and you can ignore it
  • C = 0: This is the strand. 0 means it came from the plus strand, 1 means the minus strand.
  • D = 407422: This is the internal start coordinate
  • E = 407571: This is the internal stop coordinate
  • F = 399422: This is scaffold-relative start coordinate, zero-based (corresponding to pos=399423 in sam format)
  • G = ecoli: This is the scaffold name.

This is confusing so I will add a new naming mode that is just ID_strand_start_stop_name, e.g. 0_+_399422_399571_ecoli, which will be easier to use.

ADD REPLY
0
Entering edit mode

My mistake, the genomic origin names don't work with fasta output, only fastq. I've fixed that in the dev branch and it will be in the next update.

ADD REPLY
0
Entering edit mode

OK some how it is clear now , still I do not understand why 12 contigs Or what is the purpose of it. about the update you have an idea when I can download it?

ADD REPLY

Login before adding your answer.

Traffic: 2668 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6