wgsim - number of reads in ouput does not match set parameter
1
2
Entering edit mode
10.0 years ago

I'm attempting to simulate reads from the human transcriptome to test an RNAseq pipeline using Heng Li's wgsim. This seems to be the most widely used read simulator and is pretty easy to use, however I've run into a problem which may be related to some of the reference sequences (human transcripts in this case) being quite short.

The problem is that wgsim does not produce the number of reads specified by the -N flag. I'm using the following command line to produce single-end error and mutation free reads

wgsim -e0 -N 40000 -1 250 -d1 -s1 -r 0 -R 0 -X 0 in.fa out.fq /dev/null

However in wgsim does not produce the specified 40,000 reads. Sometimes it outputs a number close to the specified total, while others it outputs much fewer.

To work around this I've been overproducing reads then subsampling using seqtk, but I was wondering if anyone had seen this behavior before and had any kind of solution?

Thanks.

wgsim read-simulation • 3.4k views
ADD COMMENT
2
Entering edit mode
10.0 years ago

wgsim will attempt to generate reads covering all chromosomes in the genome, For example if you had 10 chromosomes and asked for 10 reads you'd get a randomly positioned read from each chromosome.

If it worked by true random chance you could end up with all reads coming from a single chromosome. In the new scenario the number of reads originating from a chromosome is always close to the expected proportion of the chromosome relative to the genome (also rounded to the closest nonzero integer). Due to this internal re-normalization it may produce slightly fewer or slightly more reads.

This is actually a really cool feature of wgsim!

ADD COMMENT
0
Entering edit mode

I see, that makes complete sense. However, I have one set of transcripts that consistently produces less than half the number of reads that I specify. However, other references made similarly, only output a few less than specified. Any idea what could be causing that type of behavior?

ADD REPLY
0
Entering edit mode

One situation that could lead to strange results would occur when the number of transcripts is of the same order of magnitude as N.

ADD REPLY

Login before adding your answer.

Traffic: 1457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6