Question: Why is there a difference between compressed and uncompressed files when reading it with the ShortRead::readFastq function?
gravatar for rantanplan82
16 days ago by
rantanplan820 wrote:

I'm trying to analyze some NGS files utilizing the Bioconductor-package ShortRead.

Recently I "discovered" that the ShortRead-function readFastq() is able to read normal .fastq files and also compressed .fastq.gz files. In my understanding, the file format should not influence the results.

dat <- ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq")
dat.gz <- ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq.gz")


I get all the NGS-results as compressed *.fastq.gz files. Therefore, for this "test" I basically used the same file, first I analysed it in its compressed state and than I did the same thing but decompressed it beforehand using "gunzip".

But for some reason, I get only half of the number of sequences if I read the .fastq.gz file compared to the uncompressed file. If I analyze the data further I get basically the same number of unique sequences but with also exactly only half of the reads per unique sequence. Just to avoid confusion, with "reads" I mean how often can I find the same unique sequence within the .fastq file.

Unfortunately, I cannot share the .fastq files since they are confidential but, assuming this is not only a problem on my side, this phenomenon should be present with every available .fastq file.

After some time I investigated this issue further by utilising an artificially small test-file, containing only 5 unique sequences. Reading the compressed file ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq.gz") leads to actually getting all 5 sequences just once but reading the decompressed file ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq") results in getting all 5 sequences twice (so in total 10 sequences). I also double-checked the .fastq and the fastq.gz file and there should only be 5 sequences in total.

I'm using R version 3.5.0 (2018-04-23) -- "Joy in Playing" and the ‘ShortRead’ package version 1.40.0.

Can somebody maybe reproduce this issue with there own .fastq-files?

sequencing bioconductor R • 103 views
ADD COMMENTlink modified 16 days ago by benformatics840 • written 16 days ago by rantanplan820

Is there a chance that dat was for some reason being appended to, instead of overwritten?

ADD REPLYlink written 16 days ago by swbarnes25.6k
gravatar for benformatics
16 days ago by
ETH Zurich
benformatics840 wrote:

It seems to be something unique to you or specifically on your end? Maybe try copying your original file like I did? Or post your question on the Bioconductor support site

on the shell:

cp test.fastq test2.fastq
gzip test2.fastq

in R:

> dat.gz <- ShortRead::readFastq(dirPath = "./", pattern = "test2.fastq.gz")
> dat <- ShortRead::readFastq(dirPath = "./", pattern = "test.fastq")
> length(dat)
[1] 5
> length(dat.gz)
[1] 5


R version 3.5.1 (2018-07-02) -- "Feather Spray"
ADD COMMENTlink written 16 days ago by benformatics840

Further unable to replicate your problem with: ShortRead_1.40.0 and R 3.5.2

ADD REPLYlink written 16 days ago by benformatics840

Thanks for trying it out and it seems you are right, for some mysterious reason its appears to be indeed a problem on my side. I followed your suggestion and I got the same results as you did. But I played around with it further and I could "solve" the initially stated problem, using the exact same file. I copied the .fastq.gz file and "gunzip" it. By doing that I got an uncompressed .fastq file which after reading it with ShortRead::readFastq() shows the same length as the compressed one. But if I "gunzip" it with the "-k" flag, in order to spare me the necessity to copy it first, I again get a .fastq file which after reading it with the ShortRead::readFastq() command shows twice the length. So now I have two basically identical .fastq files which give different results when reading them with the ShortRead::readFastq() function in R. In order to find out if there is some kind of difference, I compared those two .fastq files outside of the R environment and there seems to be no difference at all. I compared them using the standard linux command line tools like "wc -l", "diff", "cmp", all of which show no difference between these two files.

To be honest I am a little bit baffled and would be happy for any suggestions whatsoever.

ADD REPLYlink modified 13 days ago • written 13 days ago by rantanplan820

I'm so sorry that I wasted your time, but I figured it out. My problem was that I had both files, the compressed and the uncompressed one, within the same directory with basically the same file name, namely "test.fastq" and "test.fastq.gz" and since ShortRead::readFastq() does not take actual file names but only file name patterns, it read both files at the same time and combined both file contents within the same R object, therefore I ended up with twice as much reads as I was supposed to.

But again, thank you very much for getting involved!

ADD REPLYlink modified 13 days ago • written 13 days ago by rantanplan820
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1063 users visited in the last hour