I've been a 'standard programmer' for many years but have recently moved into bioinformatics and I can see that the types of programs I need to write now are fundamentally different form what I used to write: due to the huge volume of data and amount of processing performed means I need to shift my mindset when I design programs from a 'serial' design to a 'parallel' design.
I've had a look for books on parallel computing and they are WAY too technical for what I'm looking for. I'm looking for tutorials/guidelines, not necessarily specific to bioinformatics, to make me start 'thinking in parallel' if that makes sense.
I was also wondering if there was software available where you could emulate a multi processor environment. I was thinking perhaps if I started working in a multi processor envinonment I would start thinking that way.
As a basic example I have a perl script I've been passed to work on which sends 12 millions SNPs to one function, waits patiently for this function to return and then shuttles the 12 million SNPs off somewhere else. I've never heard the term before today but this is an embarassigly parallel problem. It's crying out for parallelism (or parallelisation - see i don't even know the right words!) but I would have implemented the code in exactly the same way myself as I don't think parallel.
So really, I'm looking for books/tutorials/websites/guides to help me think parallel. I'd also value other people's insights and experiences but I'm aware that this is a vague question and that this forum appreciates questions where you can have a direct answer and not simply discuss things.
Thanks in advance for your help
An important part of any parallel programming is thoughtful domain decomposition. In biology you should look for natural parallelism in the objects themselves. For next-generation sequencing, that means parallel by sample, flowcell, tile, or lane. For other problems it's by gene, chromosome, domain, or conformation.
Once you've decomposed the problem you can choose a technology that enables you to express that parallelism by message passaging, map reduce, or batch processing.
A little searching will get you tons of how-tos talking about the map-reduce paradigm, but I've always found the terminology confusing. It's easier for me to think about it as a three step process: split, process, and combine
This is going to be the basic framework for pretty much all of your parallel scripts. Wrap your head around this and you're 90% of the way there.
Some other scattered thoughts:
Don't worry. It takes time to train yourself to see problems in this way. Once you get it, though, you start seeing the patterns everywhere.
If data is relatively small, don't waste time parallelizing your code. As a bioinformatician, your goal is usually to get the data munged quickly, rather than producing perfect and optimized code.
If you've got big data that's easy to split (like your 12M SNPs), think about what level is easiest to split at. You can physically split the data into seperate files, then launch the same script on subsets of the data. Alternately, you can parcel out different threads or processes from within your scripts.
I usually use the first approach. Got an 8-core machine (or a cluster)? Use
split to chop the data into smaller chunks, then farm out the processes using a for loop or gnu parallel. Cat the results together and you're in business.
Since you're working with genomic data, processing each chromosome independently is often a easy and natural way to split the data.
Systems like Hadoop are great too, but learning a system like that is probably overkill, at least at first.
If you're using R on a single multicore machine, I can recommend the doMC/foreach packages. They're fairly intuitive and work well.
If you're just doing once-off jobs for yourself, I'd echo Istvan's sentiments and suggest you make use of simple techniques where feasible (if that's what he's suggesting!)
e.g. in the SNP example, if you have four processors, maybe you can just break the file into 4 smaller files, each with 3 million SNPs, and kick off four instances of the SNP script, each reading from its own input file and writing to its own output file.
I've rarely dabbled in fully-fledged parallel programming, but I've always found it time-consuming and error-prone. I almost always try to avoid it these days, opting instead to invoke smaller jobs in parallel from the command line.
I think you'll find that doing things in parallell requires you to learn a lot of technicalities. I'd advise you to concentrate on the low hanging fruit. To me, these are e.g. simple shell-scripts running sub-processes in parallel:
# process databases in parallel for d in db1.fasta db2.fasta db3.fasta; do blastall -i input.fasta -d $d -o $d.tab -m 8 & done wait # process output here
Or, if you can structure your pipeline in a makefile, you can use 'make -j' to build your targets in parallel. (A makefile uses a declarative language to express which targets (i.e. files) depend on which others, and will recursively build the dependencies of the target you specify. This lets 'make' parallelize a lot of the task automatically.)
Writing effective multi-threaded programs is quite difficult, and introduces lots of new ways for your program to fail. For getting good performance, you usually need to spend a lot of time making sure everything is properly balanced.
A good book, that covers a little of every aspect of parallel computing, is "Scalable Parallel Computing" http://www.amazon.com/Scalable-Parallel-Computing-Architecture-Programming/dp/0070317984
The book introduces concepts such as SIMD - single instruction, multiple data - that sounds like it is appropriate for your situation.
While the 2 reviews are from people who did not like the book, for $3 - 4 bucks what do you have to lose?! :-)