Question: Why do R1 and R2 compressed files have different size
0
gravatar for MAPK
8 weeks ago by
MAPK1.3k
United States
MAPK1.3k wrote:

I have a transcriptome data of 10.8gb R1.fastq and R2.fastq each. I then compressed these pairs using gzip R1.fastq and gzip R2.fast2, and now the files are 2.2gb and 2.4gb. Is it possible for two compressed files to have different size when the uncompressed files are of same size?

gzip fastq • 245 views
ADD COMMENTlink modified 8 weeks ago by dariober9.7k • written 8 weeks ago by MAPK1.3k
3

File sizes should never be used as quantitatve anything. Count the number of reads in both files if you want to be certain.

ADD REPLYlink written 8 weeks ago by genomax58k

Thanks! I was submitting these pairs to NCBI sra and wanted to make sure this won't cause any problem.

ADD REPLYlink written 8 weeks ago by MAPK1.3k

As you know I had this problem last time with the SRA file where two files were asymetric. I just wanted to submit the compressed file this time. Yes the wc -l indicates same number for both files

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by MAPK1.3k
1

Upload from a wired fast connection so there is no chance of corruption/interruption when doing the uploads.

ADD REPLYlink written 8 weeks ago by genomax58k
4
gravatar for swbarnes2
8 weeks ago by
swbarnes24.5k
United States
swbarnes24.5k wrote:

Yes. It's perfectly possible, even if the reads are the same length. One might have sequences that are a little more repetitive, and therefore more compressible. If they have the same number of lines, that's all that matters.

It of course also possible to run gzip with different levels of compression, but you don't seem to have done that. in this case.

ADD COMMENTlink written 8 weeks ago by swbarnes24.5k

One might have sequences that are a little more repetitive

Mmm... The difference the OP observes is quite noticeable. If the sequence is the cause, it may indicate some problem as read1's and read2's should be pretty random with respect to the genomic position. See my answer below for an alternative explanation. (Unless by "sequence" you include also the quality string, in which case my answer is similar to yours)

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by dariober9.7k
2
gravatar for dariober
8 weeks ago by
dariober9.7k
Glasgow - UK
dariober9.7k wrote:

A wild guess... Second-in-pair reads usually have base qualities that drops faster along the read compared to first-in-pair. This makes the quality line on each fastq record more variable (i.e. more random and less compressible) in R2 than in R1.

ADD COMMENTlink written 8 weeks ago by dariober9.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1821 users visited in the last hour