Question

Shortread Fastqsampler Yields Fewer Than Expected Sequences?

1

Entering edit mode

12.8 years ago

Owen S. ▴ 370

I have a fastq file with over 2 million reads. I am trying to use FastqSampler to select 2 million reads at random. But strangely, I don't get 2 million, I get something less -- 1929951 in the following example. Why?

Might it have something to do with the way FastqSampler chunks the input file? (author is describing this here: A: Selecting random pairs from fastq? )

It works fine if I set n to 1 million reads.

> library(ShortRead)
> fq=readFastq("file.fq")
> fq
class: ShortReadQ
length: 2198402 reads; width: 100 cycles
> 
> fqs=FastqSampler("file.fq", n=2e6)
> yield(fqs)
class: ShortReadQ
length: 1929951 reads; width: 100 cycles
> 
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ShortRead_1.14.4     Rsamtools_1.8.6      lattice_0.20-13      Biostrings_2.24.1    GenomicRanges_1.8.13 IRanges_1.14.4      

loaded via a namespace (and not attached):
[1] Biobase_2.16.0 grid_2.15.2    hwriter_1.3    tools_2.15.2

bioconductor • 3.5k views

ADD COMMENT • link updated 11.6 years ago by Biostar 20 • written 12.8 years ago by Owen S. ▴ 370

score 1 · Accepted Answer · 2013-03-01

Sorry for the delayed reply. I'm not able to reproduce this with the current version of ShortRead (1.16.3); there were bug fixes related to FastqSampler between the version you are using and the current version.

> set.seed(123)
> fl <- "~/b/working/tmp.fastq"
> readFastq(fl)
class: ShortReadQ
length: 2198402 reads; width: 37 cycles
> fqs=FastqSampler(fl, n=2e6)
> yield(fqs)
class: ShortReadQ
length: 2000000 reads; width: 37 cycles
> sessionInfo()
R version 2.15.2 Patched (2012-12-23 r61401)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.16.4     latticeExtra_0.6-24  RColorBrewer_1.0-5  
 [4] Rsamtools_1.10.2     lattice_0.20-13      Biostrings_2.26.3   
 [7] GenomicRanges_1.10.7 IRanges_1.16.6       BiocGenerics_0.4.0  
[10] BiocInstaller_1.8.3 

loaded via a namespace (and not attached):
[1] Biobase_2.18.0  bitops_1.0-5    grid_2.15.2     hwriter_1.3    
[5] parallel_2.15.2 stats4_2.15.2   zlibbioc_1.4.0

Try

source("http://bioconductor.org/biocLite.R")
biocLite(character())

to update packages. Please ask follow-up questions on the Bioconductor mailing list (no subscription required).