How To Think In Parallel In Bioinformatics
7
16
Entering edit mode
11.7 years ago
Andrea_Bio ★ 2.8k

Hi

I've been a 'standard programmer' for many years but have recently moved into bioinformatics and I can see that the types of programs I need to write now are fundamentally different form what I used to write: due to the huge volume of data and amount of processing performed means I need to shift my mindset when I design programs from a 'serial' design to a 'parallel' design.

I've had a look for books on parallel computing and they are WAY too technical for what I'm looking for. I'm looking for tutorials/guidelines, not necessarily specific to bioinformatics, to make me start 'thinking in parallel' if that makes sense.

I was also wondering if there was software available where you could emulate a multi processor environment. I was thinking perhaps if I started working in a multi processor envinonment I would start thinking that way.

As a basic example I have a perl script I've been passed to work on which sends 12 millions SNPs to one function, waits patiently for this function to return and then shuttles the 12 million SNPs off somewhere else. I've never heard the term before today but this is an embarassigly parallel problem. It's crying out for parallelism (or parallelisation - see i don't even know the right words!) but I would have implemented the code in exactly the same way myself as I don't think parallel.

So really, I'm looking for books/tutorials/websites/guides to help me think parallel. I'd also value other people's insights and experiences but I'm aware that this is a vague question and that this forum appreciates questions where you can have a direct answer and not simply discuss things.

parallel • 4.9k views
4
Entering edit mode

Somewhat tongue in cheek but I found the following a great "guide" http://teddziuba.com/2010/10/taco-bell-programming.html

2
Entering edit mode
1
Entering edit mode

If you buy a quad-core or 6-core machine to work off of, you won't have to emulate a multi-processor environment in software. That's entry-level hardware these days.

0
Entering edit mode

11
Entering edit mode
11.7 years ago
Kraut ▴ 230

An important part of any parallel programming is thoughtful domain decomposition. In biology you should look for natural parallelism in the objects themselves. For next-generation sequencing, that means parallel by sample, flowcell, tile, or lane. For other problems it's by gene, chromosome, domain, or conformation.

Once you've decomposed the problem you can choose a technology that enables you to express that parallelism by message passaging, map reduce, or batch processing.

0
Entering edit mode

0
Entering edit mode

Nicely put - I'd also add the natural parallelism inherent in statistical approaches (resampling, permutation, parametrization, cross-validation)

10
Entering edit mode
11.7 years ago

A little searching will get you tons of how-tos talking about the map-reduce paradigm, but I've always found the terminology confusing. It's easier for me to think about it as a three step process: split, process, and combine

• Split means figure out how to subset your data into chunks
• Process means run some scripts to do the computation on each chunk in parallel
• Combine means take the results and put them back together.

This is going to be the basic framework for pretty much all of your parallel scripts. Wrap your head around this and you're 90% of the way there.

Some other scattered thoughts:

• Don't worry. It takes time to train yourself to see problems in this way. Once you get it, though, you start seeing the patterns everywhere.

• If data is relatively small, don't waste time parallelizing your code. As a bioinformatician, your goal is usually to get the data munged quickly, rather than producing perfect and optimized code.

• If you've got big data that's easy to split (like your 12M SNPs), think about what level is easiest to split at. You can physically split the data into seperate files, then launch the same script on subsets of the data. Alternately, you can parcel out different threads or processes from within your scripts.

I usually use the first approach. Got an 8-core machine (or a cluster)? Use split to chop the data into smaller chunks, then farm out the processes using a for loop or gnu parallel. Cat the results together and you're in business.

• Since you're working with genomic data, processing each chromosome independently is often a easy and natural way to split the data.

• Systems like Hadoop are great too, but learning a system like that is probably overkill, at least at first.

• If you're using R on a single multicore machine, I can recommend the doMC/foreach packages. They're fairly intuitive and work well.

5
Entering edit mode
11.7 years ago
Bio_X2Y ★ 4.2k

If you're just doing once-off jobs for yourself, I'd echo Istvan's sentiments and suggest you make use of simple techniques where feasible (if that's what he's suggesting!)

e.g. in the SNP example, if you have four processors, maybe you can just break the file into 4 smaller files, each with 3 million SNPs, and kick off four instances of the SNP script, each reading from its own input file and writing to its own output file.

I've rarely dabbled in fully-fledged parallel programming, but I've always found it time-consuming and error-prone. I almost always try to avoid it these days, opting instead to invoke smaller jobs in parallel from the command line.

0
Entering edit mode

I agree, when exploiting straighforward data parallelism - it pays to think outside the bun.. Not all HPC can be done one one machine and one datastore though, which might take you MPI and beyond

1
Entering edit mode
11.7 years ago
Ketil 4.1k

I think you'll find that doing things in parallel requires you to learn a lot of technicalities. I'd advise you to concentrate on the low hanging fruit. To me, these are e.g. simple shell-scripts running sub-processes in parallel:

# process databases in parallel
for d in db1.fasta db2.fasta db3.fasta; do
blastall -i input.fasta -d $d -o$d.tab -m 8 &
done
wait
# process output here


Or, if you can structure your pipeline in a makefile, you can use 'make -j' to build your targets in parallel. (A makefile uses a declarative language to express which targets (i.e. files) depend on which others, and will recursively build the dependencies of the target you specify. This lets 'make' parallelize a lot of the task automatically.)

Writing effective multi-threaded programs is quite difficult, and introduces lots of new ways for your program to fail. For getting good performance, you usually need to spend a lot of time making sure everything is properly balanced.

1
Entering edit mode
7.7 years ago
ole.tange ★ 4.2k

The wonderful thing about bioinformatics is that many of the problems are embarrassingly parallelizeable. This makes it possible to reuse your sequential programming skills and let a general parallelizer such as GNU Parallel deal with the parallelization. Even if you made a specialized parallelized tool you will often not get any noticeably speed improvement over GNU Parallel for that kind of problems.

GNU Parallel has been used by bioinformaticians for years, and several of the options have been developed with bioinformatics in mind. The hard part is to understand how to use it most efficiently. These examples should get you started:

0
Entering edit mode
11.7 years ago
Burke ▴ 290

A good book, that covers a little of every aspect of parallel computing, is "Scalable Parallel Computing" (Amazon link).

The book introduces concepts such as SIMD - single instruction, multiple data - that sounds like it is appropriate for your situation.

While the 2 reviews are from people who did not like the book, for \$3 - 4 bucks what do you have to lose?! :-)

0
Entering edit mode
11.7 years ago
Austinlew ▴ 310

This is a short tutorial to show how to make a multi-threaded Perl program, hope you will get some idea from it.