I have copies of the same fastq file
(let's call it samp1_R1.fastq.gz
) in two different locations (e.g., /foo/samp1_R1.fastq.gz
and /bar/samp1_R1.fastq.gz
). These files are supposed to be identical (the one at /foo
is a "working copy" while the one at /bar
is a backup).
I happened to generate MD5 checksums from both these files today, and to my surprise, the hashes are different:
$ md5sum /foo/samp1_R1.fastq.gz
13a75f5e319fa772faa85beb04317718 /foo/samp1_R1.fastq.gz
$ md5sum /bar/samp1_R1.fastq.gz
e7a4f3c293361fa569a192f8a26a141d /bar/samp1_R1.fastq.gz
Does this truly mean that the files are now different? I cannot think of anything that could have modified one of these files (the "working copy" was only ever used as an input to standard RNA-seq tools). Could the mere act of uncompressing (and recompressing) the file update the hash like that? (That's the only thing I can recall doing to the file itself.)
Now I'm also wondering whether I should rerun analyses that were based off of this file.
Your inputs would be appreciated.
Some IT focused solutions to consider:
https://serverfault.com/questions/36966/md5sum-repeatedly-gives-different-checksum-for-same-file-on-same-machine
https://serverfault.com/questions/110208/different-md5sums-for-same-tar-contents and others
I think this is the right solution to your problem: https://unix.stackexchange.com/questions/111645/different-checksum-of-original-file-and-copied-fileEdit: I misinterpreted the solution above so apologies for that.
Thanks for the links, Genomax. With regards to your proposed solution, I'd have to disagree: I was not passing the filename string to
md5sum
but rather its full path. I'd argue the answers at the serverfault link from your comment is more applicable here in this case.But that said, pointing out the possibility of passing a string to
md5sum
is definitely something to note and to bear in mind. (And will perhaps help someone who inadvertently runs into this in the future.)Pierre's solution is tackling the
gzip
possibility.Sounds like you copied the same file in two locations so presumably it is the same identical file i.e. no changes were made? So there could also be an underlying storage related issue.