Shortread Fastqsampler Yields Fewer Than Expected Sequences?
1
1
Entering edit mode
11.2 years ago
Owen S. ▴ 370

I have a fastq file with over 2 million reads. I am trying to use FastqSampler to select 2 million reads at random. But strangely, I don't get 2 million, I get something less -- 1929951 in the following example. Why?

Might it have something to do with the way FastqSampler chunks the input file? (author is describing this here: A: Selecting random pairs from fastq? )

It works fine if I set n to 1 million reads.

> library(ShortRead)
> fq=readFastq("file.fq")
> fq
class: ShortReadQ
length: 2198402 reads; width: 100 cycles
> 
> fqs=FastqSampler("file.fq", n=2e6)
> yield(fqs)
class: ShortReadQ
length: 1929951 reads; width: 100 cycles
> 
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ShortRead_1.14.4     Rsamtools_1.8.6      lattice_0.20-13      Biostrings_2.24.1    GenomicRanges_1.8.13 IRanges_1.14.4      

loaded via a namespace (and not attached):
[1] Biobase_2.16.0 grid_2.15.2    hwriter_1.3    tools_2.15.2
bioconductor • 3.2k views
ADD COMMENT
1
Entering edit mode
11.1 years ago
Martin Morgan ★ 1.6k

Sorry for the delayed reply. I'm not able to reproduce this with the current version of ShortRead (1.16.3); there were bug fixes related to FastqSampler between the version you are using and the current version.

> set.seed(123)
> fl <- "~/b/working/tmp.fastq"
> readFastq(fl)
class: ShortReadQ
length: 2198402 reads; width: 37 cycles
> fqs=FastqSampler(fl, n=2e6)
> yield(fqs)
class: ShortReadQ
length: 2000000 reads; width: 37 cycles
> sessionInfo()
R version 2.15.2 Patched (2012-12-23 r61401)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.16.4     latticeExtra_0.6-24  RColorBrewer_1.0-5  
 [4] Rsamtools_1.10.2     lattice_0.20-13      Biostrings_2.26.3   
 [7] GenomicRanges_1.10.7 IRanges_1.16.6       BiocGenerics_0.4.0  
[10] BiocInstaller_1.8.3 

loaded via a namespace (and not attached):
[1] Biobase_2.18.0  bitops_1.0-5    grid_2.15.2     hwriter_1.3    
[5] parallel_2.15.2 stats4_2.15.2   zlibbioc_1.4.0

Try

source("http://bioconductor.org/biocLite.R")
biocLite(character())

to update packages. Please ask follow-up questions on the Bioconductor mailing list (no subscription required).

ADD COMMENT

Login before adding your answer.

Traffic: 2721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6