Entering edit mode
4.6 years ago
marongiu.luigi ▴ 670
I wanted to ask if it is possible for two paired files -- from paired end sequencing -- to have a different byte size, for instance:
-rw------- 1 m G300 18759221836 Jun 11 11:26 501N-1_1.fq.gz -rw------- 1 m G300 19584616095 Jun 11 11:36 501N-1_2.fq.gz
Probably this is fine because the two reads have different lengths due to the chemistry of the reaction. What about two mates -- from mate-pair sequencing?
File sizes are never a good indicator for any conclusion. Count the number of fastq headers if you are looking to see if the same number of reads are present in two files.
In that case, still there are differences:
The numbers should be divided by 4 to give the number of reads, but it is clear that are different. Thanks anyway to all.
you need to count un-zipped file.
The way you are counting now is the number of newlines (\n) in binary ".gz" file!
Ops, I forgot. you are right with zcat it becomes:
So the file size is not useful. Thanks.
Have you done anything to these files or are they "original"?
File sizes can only be used as red flags when there is a significant difference in size, ~orders of magnitude, and even then the files need to share a lot of context to be compared. For example, if your mate files were 5G and 12G in size, there's probably something wrong. But if you have 2 BAM files that are 15G and 60G in size, and that's all the information you have on the files, you cannot derive any conclusion from that.
Read2 is usually bigger than read1 due to its quality is usually lower than read1, which decreases the compression ration.
If you want to check whether the PE files are consistent, you can use a tool called PE check: https://github.com/OpenGene/pecheck