Question

wgsim - number of reads in ouput does not match set parameter

2

Entering edit mode

11.2 years ago

Brett Kennedy ▴ 50

I'm attempting to simulate reads from the human transcriptome to test an RNAseq pipeline using Heng Li's wgsim. This seems to be the most widely used read simulator and is pretty easy to use, however I've run into a problem which may be related to some of the reference sequences (human transcripts in this case) being quite short.

The problem is that wgsim does not produce the number of reads specified by the -N flag. I'm using the following command line to produce single-end error and mutation free reads

wgsim -e0 -N 40000 -1 250 -d1 -s1 -r 0 -R 0 -X 0 in.fa out.fq /dev/null

However in wgsim does not produce the specified 40,000 reads. Sometimes it outputs a number close to the specified total, while others it outputs much fewer.

To work around this I've been overproducing reads then subsampling using seqtk, but I was wondering if anyone had seen this behavior before and had any kind of solution?

Thanks.

wgsim read-simulation • 3.9k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 11.2 years ago by Brett Kennedy ▴ 50

Ram · Answer 1 · 2014-05-08

2

Entering edit mode

11.2 years ago

Istvan Albert 102k

wgsim will attempt to generate reads covering all chromosomes in the genome, For example if you had 10 chromosomes and asked for 10 reads you'd get a randomly positioned read from each chromosome.

If it worked by true random chance you could end up with all reads coming from a single chromosome. In the new scenario the number of reads originating from a chromosome is always close to the expected proportion of the chromosome relative to the genome (also rounded to the closest nonzero integer). Due to this internal re-normalization it may produce slightly fewer or slightly more reads.

This is actually a really cool feature of wgsim!

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 11.2 years ago by Istvan Albert 102k

0

Entering edit mode

I see, that makes complete sense. However, I have one set of transcripts that consistently produces less than half the number of reads that I specify. However, other references made similarly, only output a few less than specified. Any idea what could be causing that type of behavior?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 11.2 years ago by Brett Kennedy ▴ 50

0

Entering edit mode

One situation that could lead to strange results would occur when the number of transcripts is of the same order of magnitude as N.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 11.2 years ago by Istvan Albert 102k