I have a fastq file with over 2 million reads. I am trying to use FastqSampler to select 2 million reads at random. But strangely, I don't get 2 million, I get something less -- 1929951 in the following example. Why?
Might it have something to do with the way FastqSampler chunks the input file? (author is describing this here: A: Selecting random pairs from fastq? )
It works fine if I set n to 1 million reads.
> library(ShortRead)
> fq=readFastq("file.fq")
> fq
class: ShortReadQ
length: 2198402 reads; width: 100 cycles
> 
> fqs=FastqSampler("file.fq", n=2e6)
> yield(fqs)
class: ShortReadQ
length: 1929951 reads; width: 100 cycles
> 
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] ShortRead_1.14.4     Rsamtools_1.8.6      lattice_0.20-13      Biostrings_2.24.1    GenomicRanges_1.8.13 IRanges_1.14.4      
loaded via a namespace (and not attached):
[1] Biobase_2.16.0 grid_2.15.2    hwriter_1.3    tools_2.15.2