Why do R1 and R2 compressed files have different size
2
3
Entering edit mode
3.7 years ago
MAPK ★ 2.0k

I have a transcriptome data of 10.8gb R1.fastq and R2.fastq each. I then compressed these pairs using gzip R1.fastq and gzip R2.fast2, and now the files are 2.2gb and 2.4gb. Is it possible for two compressed files to have different size when the uncompressed files are of same size?

fastq gzip • 4.1k views
4
Entering edit mode

File sizes should never be used as quantitatve anything. Count the number of reads in both files if you want to be certain.

0
Entering edit mode

Thanks! I was submitting these pairs to NCBI sra and wanted to make sure this won't cause any problem.

0
Entering edit mode

As you know I had this problem last time with the SRA file where two files were asymetric. I just wanted to submit the compressed file this time. Yes the wc -l indicates same number for both files

1
Entering edit mode

Upload from a wired fast connection so there is no chance of corruption/interruption when doing the uploads.

8
Entering edit mode
3.7 years ago

A wild guess... Second-in-pair reads usually have base qualities that drops faster along the read compared to first-in-pair. This makes the quality line on each fastq record more variable (i.e. more random and less compressible) in R2 than in R1.

7
Entering edit mode
3.7 years ago

Yes. It's perfectly possible, even if the reads are the same length. One might have sequences that are a little more repetitive, and therefore more compressible. If they have the same number of lines, that's all that matters.

It of course also possible to run gzip with different levels of compression, but you don't seem to have done that. in this case.

0
Entering edit mode

One might have sequences that are a little more repetitive

Mmm... The difference the OP observes is quite noticeable. If the sequence is the cause, it may indicate some problem as read1's and read2's should be pretty random with respect to the genomic position. See my answer below for an alternative explanation. (Unless by "sequence" you include also the quality string, in which case my answer is similar to yours)