Question: wgsim - number of reads in ouput does not match set parameter
2
gravatar for Brett Kennedy
5.3 years ago by
Salt Lake City, UT
Brett Kennedy50 wrote:

I'm attempting to simulate reads from the human transcriptome to test an RNAseq pipeline using Heng Li's wgsim (https://github.com/lh3/wgsim).  This seems to be the most widely used read simulator and is pretty easy to use, however I've run into a  problem which may be related to some of the reference sequences (human transcripts in this case) being quite short.

The problem is that wgsim does not produce the number of reads specified by the -N flag.  I'm using the following command line to produce single-end error and mutation free reads

wgsim -e0 -N 40000 -1 250 -d1 -s1 -r 0 -R 0 -X 0 in.fa out.fq /dev/null

However in wgsim does not produce the specified 40,000 reads.  Sometimes it outputs a number close to the specified total, while others it outputs much fewer.

To work around this I've been overproducing reads then subsampling using seqtk, but I was wondering if anyone had seen this behavior before and had any kind of solution?

 

Thanks. 

ADD COMMENTlink modified 5.3 years ago by Istvan Albert ♦♦ 81k • written 5.3 years ago by Brett Kennedy50
2
gravatar for Istvan Albert
5.3 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

wgsim will attempt to generate reads covering all chromosomes in the genome, For example if you had 10 chromosomes and asked for 10 reads you'd get a randomly positioned read from each chromosome.

If it worked by true random chance you could end up with all reads coming from a single chromosome. In the new scenario the number of reads originating from a chromosome is always close to the expected proportion of the chromosome relative to the genome (also rounded to the closest nonzero integer). Due to this internal re-normalization it may produce slightly fewer or slightly more reads. 

This is actually a really cool feature of wgsim! 

 

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Istvan Albert ♦♦ 81k

I see, that makes complete sense.  However, I have one set of transcripts that consistently produces less than half the number of reads that I specify.  However, other references made similarly, only output a few less than specified.  Any idea what could be causing that type of behavior?

ADD REPLYlink written 5.3 years ago by Brett Kennedy50

One situation that could lead to strange results would occur when the number of transcripts is of the same order of magnitude as N. 

ADD REPLYlink written 5.3 years ago by Istvan Albert ♦♦ 81k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 804 users visited in the last hour