Question: MD5 Hashing for Many Files
0
gravatar for jiwpark00
2.8 years ago by
jiwpark00200
jiwpark00200 wrote:

I was wondering, is there a good way to do MD5 hashing to check many files you download from NCBI (genomes, sequences, etc..)?

By "many", I mean that say you want to download 1,000 microarray data - is there a good way to do MD5 hashing to check for identity of the files?

I know you can write for loop for this but I wasn't sure if there was some clever way...

rna-seq md5 • 1.5k views
ADD COMMENTlink written 2.8 years ago by jiwpark00200
2

gnu-parallel is a good friend for a bioinformaticion. You can fork multiple jobs very easily parallel "md5sum {}" ::: * Even on a desktop with multiple cores you can save a lot of time and the syntax is very convenient and avoids explicit looping

ADD REPLYlink written 2.8 years ago by microfuge1.1k

I see gnu parallel recommended many times here but how does that scale with disk I/O. OP has a 1000 files so surely this has the potential of saturating the I/O, if not invoked properly.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by genomax69k

There are options to limit the parallel processing to a certain amount of jobs and/or memory usage depending on the limiting factor for your system.

ADD REPLYlink written 2.8 years ago by WouterDeCoster40k

Yes genomax2 it might saturate if invoked improperly. But like make it as a -j n option to limit number of jobs at a time to n. I usually use -j 2 or 4 on a 8 core system.

ADD REPLYlink written 2.8 years ago by microfuge1.1k

The best advice is to try and measure. For details see: https://oletange.wordpress.com/2015/07/04/parallel-disk-io-is-it-faster/

ADD REPLYlink written 2.8 years ago by ole.tange3.4k

I apologize for my naiveness but how do I put multiple file directories inside {}?

In other words:

parallel "md5sum directory1/file1 directory1/file2 " :::* like that? Sorry I'm not familiar with parallel commands.

ADD REPLYlink written 2.8 years ago by jiwpark00200

Another power tool is find, e.g.: find . -name '*.bam' | parallel -j 3 "md5sum {}"

In this case for bam files, but you easily customize this. And it's recursive, so it will search also in subdirectories for files matching ".bam". You can easily check it's output by redirecting the output of find to your terminal or a file `find . -name '*.bam' > temp.txt

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by WouterDeCoster40k
2

You could calculate the md5sum while downloading the file, e.g.:

wget -O - http://www.address.org/file.ext | tee file.ext | md5sum > file.ext.md5

Then you could just compare the downloaded md5sum with the ones available at the site of origin.

ADD REPLYlink written 2.8 years ago by h.mon26k
1

BTW This is just to play around and I have not tested it for very large files or extensively. But if the downloads are done via rsync (not always possible) rsync can be made to report a md5sum in the logfile.

rsync  -a --log-file=x --out-format="%C" ../study.table .
eab8552ece4d7ed56082192e81bf0dd1

md5sum ../study.table 
eab8552ece4d7ed56082192e81bf0dd1

The %C will will report md5sums for files in rsync output. The study.table file was ~60M in size. More testing is needed to see if this can work consistently or is of any use.

ADD REPLYlink written 2.8 years ago by microfuge1.1k

If you had access to a cluster you could start those jobs in parallel but otherwise you are on the right track.

ADD REPLYlink written 2.8 years ago by genomax69k

Thank you everyone! I appreciate the help greatly!

ADD REPLYlink written 2.7 years ago by jiwpark00200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 995 users visited in the last hour