MD5 checksums of two "identical" Gzipped fastq files different?
1
2
Entering edit mode
3.4 years ago
Dunois ★ 2.5k

I have copies of the same fastq file (let's call it samp1_R1.fastq.gz) in two different locations (e.g., /foo/samp1_R1.fastq.gz and /bar/samp1_R1.fastq.gz). These files are supposed to be identical (the one at /foo is a "working copy" while the one at /bar is a backup).

I happened to generate MD5 checksums from both these files today, and to my surprise, the hashes are different:

$ md5sum /foo/samp1_R1.fastq.gz
13a75f5e319fa772faa85beb04317718 /foo/samp1_R1.fastq.gz

$ md5sum /bar/samp1_R1.fastq.gz
e7a4f3c293361fa569a192f8a26a141d /bar/samp1_R1.fastq.gz

Does this truly mean that the files are now different? I cannot think of anything that could have modified one of these files (the "working copy" was only ever used as an input to standard RNA-seq tools). Could the mere act of uncompressing (and recompressing) the file update the hash like that? (That's the only thing I can recall doing to the file itself.)

Now I'm also wondering whether I should rerun analyses that were based off of this file.

Your inputs would be appreciated.

checksum RNA-Seq fastq integrity • 2.2k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks for the links, Genomax. With regards to your proposed solution, I'd have to disagree: I was not passing the filename string to md5sum but rather its full path. I'd argue the answers at the serverfault link from your comment is more applicable here in this case.

But that said, pointing out the possibility of passing a string to md5sum is definitely something to note and to bear in mind. (And will perhaps help someone who inadvertently runs into this in the future.)

ADD REPLY
0
Entering edit mode

Pierre's solution is tackling the gzip possibility.

have copies of the same fastq file (let's call it samp1_R1.fastq.gz) in two different locations (e.g., /foo/samp1_R1.fastq.gz and /bar/samp1_R1.fastq.gz). These files are supposed to be identical (the one at /foo is a "working copy" while the one at /bar is a backup).

Sounds like you copied the same file in two locations so presumably it is the same identical file i.e. no changes were made? So there could also be an underlying storage related issue.

ADD REPLY
4
Entering edit mode
3.4 years ago

different compression level ? names changed ?

check with:

gunzip -c /foo/samp1_R1.fastq.gz | md5sum

gunzip -c /bar/samp1_R1.fastq.gz | md5sum

or/and

paste <( gunzip -c /foo/samp1_R1.fastq.gz)  <( gunzip -c /bar/samp1_R1.fastq.gz)  | awk -F '\t'  '($1!=$2)'
ADD COMMENT
1
Entering edit mode

Thanks for your quick suggestion Pierre. It was different compression levels, I think. The hashes are identical for the uncompressed files.

$ gunzip -c /foo/samp1_R1.fastq.gz | md5sum
8166747f939c786f9fe63dda6dd482d6  -

$ gunzip -c /bar/samp1_R1.fastq.gz | md5sum
8166747f939c786f9fe63dda6dd482d6  -

Why would the names being changed affect the hash though? Isn't it calculated based on the contents of the file?

ADD REPLY
0
Entering edit mode

Why would the names being changed affect the hash though?

yes, my hypothesis was that you could have changed the name by removing the '/2' and '/1' suffixes.

ADD REPLY
0
Entering edit mode

Ah you mean the fastq headers? I thought you were referring to the filename. I never touched the headers (or any of the other contents of the files, for that matter).

ADD REPLY

Login before adding your answer.

Traffic: 3443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6