Question

Why is there a difference between compressed and uncompressed files when reading it with the ShortRead::readFastq function?

0

Entering edit mode

5.0 years ago

rantanplan82 • 0

I'm trying to analyze some NGS files utilizing the Bioconductor-package ShortRead.

Recently I "discovered" that the ShortRead-function readFastq() is able to read normal .fastq files and also compressed .fastq.gz files. In my understanding, the file format should not influence the results.

dat <- ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq")
dat.gz <- ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq.gz")

length(dat)
length(dat.gz)

I get all the NGS-results as compressed *.fastq.gz files. Therefore, for this "test" I basically used the same file, first I analysed it in its compressed state and than I did the same thing but decompressed it beforehand using "gunzip".

But for some reason, I get only half of the number of sequences if I read the .fastq.gz file compared to the uncompressed file. If I analyze the data further I get basically the same number of unique sequences but with also exactly only half of the reads per unique sequence. Just to avoid confusion, with "reads" I mean how often can I find the same unique sequence within the .fastq file.

Unfortunately, I cannot share the .fastq files since they are confidential but, assuming this is not only a problem on my side, this phenomenon should be present with every available .fastq file.

After some time I investigated this issue further by utilising an artificially small test-file, containing only 5 unique sequences. Reading the compressed file ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq.gz") leads to actually getting all 5 sequences just once but reading the decompressed file ShortRead::readFastq(dirPath = "./", pattern = "tmp.fastq") results in getting all 5 sequences twice (so in total 10 sequences). I also double-checked the .fastq and the fastq.gz file and there should only be 5 sequences in total.

I'm using R version 3.5.0 (2018-04-23) -- "Joy in Playing" and the ‘ShortRead’ package version 1.40.0.

Can somebody maybe reproduce this issue with there own .fastq-files?

R sequencing bioconductor • 1.8k views

ADD COMMENT • link updated 5.0 years ago by benformatics 3.9k • written 5.0 years ago by rantanplan82 • 0

0

Entering edit mode

Is there a chance that dat was for some reason being appended to, instead of overwritten?

ADD REPLY • link 5.0 years ago by swbarnes2 14k

score 1 · Answer 1 · 2019-05-10

1

Entering edit mode

5.0 years ago

benformatics 3.9k

It seems to be something unique to you or specifically on your end? Maybe try copying your original file like I did? Or post your question on the Bioconductor support site

on the shell:

cp test.fastq test2.fastq
gzip test2.fastq

in R:

> dat.gz <- ShortRead::readFastq(dirPath = "./", pattern = "test2.fastq.gz")
> dat <- ShortRead::readFastq(dirPath = "./", pattern = "test.fastq")
> length(dat)
[1] 5
> length(dat.gz)
[1] 5

With:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
ShortRead_1.38.0

ADD COMMENT • link 5.0 years ago by benformatics 3.9k

0

Entering edit mode

Further unable to replicate your problem with: ShortRead_1.40.0 and R 3.5.2

ADD REPLY • link 5.0 years ago by benformatics 3.9k

0

Entering edit mode

Thanks for trying it out and it seems you are right, for some mysterious reason its appears to be indeed a problem on my side. I followed your suggestion and I got the same results as you did. But I played around with it further and I could "solve" the initially stated problem, using the exact same file. I copied the .fastq.gz file and "gunzip" it. By doing that I got an uncompressed .fastq file which after reading it with ShortRead::readFastq() shows the same length as the compressed one. But if I "gunzip" it with the "-k" flag, in order to spare me the necessity to copy it first, I again get a .fastq file which after reading it with the ShortRead::readFastq() command shows twice the length. So now I have two basically identical .fastq files which give different results when reading them with the ShortRead::readFastq() function in R. In order to find out if there is some kind of difference, I compared those two .fastq files outside of the R environment and there seems to be no difference at all. I compared them using the standard linux command line tools like "wc -l", "diff", "cmp", all of which show no difference between these two files.

To be honest I am a little bit baffled and would be happy for any suggestions whatsoever.

ADD REPLY • link 5.0 years ago by rantanplan82 • 0

0

Entering edit mode

I'm so sorry that I wasted your time, but I figured it out. My problem was that I had both files, the compressed and the uncompressed one, within the same directory with basically the same file name, namely "test.fastq" and "test.fastq.gz" and since ShortRead::readFastq() does not take actual file names but only file name patterns, it read both files at the same time and combined both file contents within the same R object, therefore I ended up with twice as much reads as I was supposed to.

But again, thank you very much for getting involved!

ADD REPLY • link 5.0 years ago by rantanplan82 • 0