Question: Best way to create syntetic metagenomic reads
0
gravatar for yoann.dufresne0
10 months ago by
France/Paris/Institut Pasteur
yoann.dufresne00 wrote:

Hello everyone,

I'm trying to test the performances of different read correctors on metagenomic data. My analysis step are as followed: 1 - Create a metagenomic dataset without sequencing errors using my own genome set (I want to control the divergence between species in the sample). 2 - Insert errors into the reads 3 - Execute multiple read correctors on this dataset. 4 - Compare their results with the reads from step 1.

Everything is working except the step 1. I tried to use CAMISIM. But the light documentation is a huge problem (I talking a lot with the creator to understand the details). Then, I tried to use InSilicoSeq that is easy to execute. But this is impossible (for now) to have perfect reads at the end of step 1.

So, do you have suggestions for this step ?

ADD COMMENTlink modified 10 months ago by genomax80k • written 10 months ago by yoann.dufresne00
0
gravatar for Carambakaracho
10 months ago by
Carambakaracho2.0k
Germany/Cologne
Carambakaracho2.0k wrote:

could samtools wgsim help you?

wgsim -h

Program: wgsim (short read simulator)
Version: 1.8 
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.15]
         -X FLOAT      probability an indel is extended [0.30]
         -S INT        seed for random generator [0, use the current time]
         -A FLOAT      discard if the fraction of ambiguous bases higher than FLOAT [0.05]
         -h            haplotype mode
ADD COMMENTlink written 10 months ago by Carambakaracho2.0k

No because it's not generating metagenomic samples. It's only for 1 genome sequencing simulation.

I'm currently writing a script to pull some species using a lognorm distribution. Maybe I will use wgsim for generating reads from the pulled genomes.

ADD REPLYlink written 10 months ago by yoann.dufresne00

Sure, implicitly I thought this is what you needed.

No because it's not generating metagenomic samples. It's only for 1 genome sequencing simulation.

Well, I cannot accept that as an answer ;-) I produced for a dozen or so genomes reads at different coverage, and shuffled them up. That gave me a very well defined metagenome.

pull some species using a lognorm distribution

Why do you need to apply a lognorm distribution for pulling (downloading?) genomes?

ADD REPLYlink modified 10 months ago • written 10 months ago by Carambakaracho2.0k
0
gravatar for genomax
10 months ago by
genomax80k
United States
genomax80k wrote:

randomreads.sh from BBMap suite has a metagenome mode. Check it out.

ADD COMMENTlink written 10 months ago by genomax80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2007 users visited in the last hour