I'm trying to analyze some NGS files utilizing the Bioconductor-package ShortRead.
Recently I "discovered" that the ShortRead-function readFastq() is able to read normal .fastq files and also compressed .fastq.gz files. In my understanding, the file format should not influence the results.
dat <- ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq") dat.gz <- ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq.gz") length(dat) length(dat.gz)
I get all the NGS-results as compressed *.fastq.gz files. Therefore, for this "test" I basically used the same file, first I analysed it in its compressed state and than I did the same thing but decompressed it beforehand using "gunzip".
But for some reason, I get only half of the number of sequences if I read the .fastq.gz file compared to the uncompressed file. If I analyze the data further I get basically the same number of unique sequences but with also exactly only half of the reads per unique sequence. Just to avoid confusion, with "reads" I mean how often can I find the same unique sequence within the .fastq file.
Unfortunately, I cannot share the .fastq files since they are confidential but, assuming this is not only a problem on my side, this phenomenon should be present with every available .fastq file.
After some time I investigated this issue further by utilising an artificially small test-file, containing only 5 unique sequences. Reading the compressed file
ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq.gz") leads to actually getting all 5 sequences just once but reading the decompressed file
ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq") results in getting all 5 sequences twice (so in total 10 sequences). I also double-checked the .fastq and the fastq.gz file and there should only be 5 sequences in total.
I'm using R version 3.5.0 (2018-04-23) -- "Joy in Playing" and the ‘ShortRead’ package version 1.40.0.
Can somebody maybe reproduce this issue with there own .fastq-files?