Question

Can two mates have different file size?

0

Entering edit mode

5.7 years ago

marongiu.luigi ▴ 710

Hello

I wanted to ask if it is possible for two paired files -- from paired end sequencing -- to have a different byte size, for instance:

-rw-------   1 m G300 18759221836 Jun 11 11:26 501N-1_1.fq.gz
-rw-------   1 m G300 19584616095 Jun 11 11:36 501N-1_2.fq.gz

Probably this is fine because the two reads have different lengths due to the chemistry of the reaction. What about two mates -- from mate-pair sequencing?

thank you

sequencing fastq • 1.2k views

ADD COMMENT • link 5.7 years ago by marongiu.luigi ▴ 710

4

Entering edit mode

File sizes are never a good indicator for any conclusion. Count the number of fastq headers if you are looking to see if the same number of reads are present in two files.

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

In that case, still there are differences:

$ wc -l 501N-1_1.fq.gz 
63553348 501N-1_1.fq.gz
$ wc -l 501N-1_2.fq.gz 
62303198 501N-1_2.fq.gz

The numbers should be divided by 4 to give the number of reads, but it is clear that are different. Thanks anyway to all.

ADD REPLY • link 5.7 years ago by marongiu.luigi ▴ 710

3

Entering edit mode

you need to count un-zipped file.

$ zcat 501N-1_1.fq.gz  | wc -l

The way you are counting now is the number of newlines (\n) in binary ".gz" file!

ADD REPLY • link updated 5.7 years ago by Ram 43k • written 5.7 years ago by Santosh Anand 5.7k

0

Entering edit mode

Ops, I forgot. you are right with zcat it becomes:

 $ zcat 501N-1_1.fq.gz | wc -l
 933442092
 $ zcat 501N-1_2.fq.gz | wc -l
 933442092

So the file size is not useful. Thanks.

ADD REPLY • link 5.7 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

Have you done anything to these files or are they "original"?

ADD REPLY • link 5.7 years ago by GenoMax 141k

2

Entering edit mode

File sizes can only be used as red flags when there is a significant difference in size, ~orders of magnitude, and even then the files need to share a lot of context to be compared. For example, if your mate files were 5G and 12G in size, there's probably something wrong. But if you have 2 BAM files that are 15G and 60G in size, and that's all the information you have on the files, you cannot derive any conclusion from that.

ADD REPLY • link 5.7 years ago by Ram 43k

1

Entering edit mode

Read2 is usually bigger than read1 due to its quality is usually lower than read1, which decreases the compression ration.

If you want to check whether the PE files are consistent, you can use a tool called PE check: https://github.com/OpenGene/pecheck

ADD REPLY • link 5.7 years ago by chen ★ 2.5k