Question: How To Think In Parallel In Bioinformatics
gravatar for Andrea_Bio
5.3 years ago by
Andrea_Bio2.2k wrote:


I've been a 'standard programmer' for many years but have recently moved into bioinformatics and I can see that the types of programs I need to write now are fundamentally different form what I used to write: due to the huge volume of data and amount of processing performed means I need to shift my mindset when I design programs from a 'serial' design to a 'parallel' design.

I've had a look for books on parallel computing and they are WAY too technical for what I'm looking for. I'm looking for tutorials/guidelines, not necessarily specific to bioinformatics, to make me start 'thinking in parallel' if that makes sense.

I was also wondering if there was software available where you could emulate a multi processor environment. I was thinking perhaps if I started working in a multi processor envinonment I would start thinking that way.

As a basic example I have a perl script I've been passed to work on which sends 12 millions SNPs to one function, waits patiently for this function to return and then shuttles the 12 million SNPs off somewhere else. I've never heard the term before today but this is an embarassigly parallel problem. It's crying out for parallelism (or parallelisation - see i don't even know the right words!) but I would have implemented the code in exactly the same way myself as I don't think parallel.

So really, I'm looking for books/tutorials/websites/guides to help me think parallel. I'd also value other people's insights and experiences but I'm aware that this is a vague question and that this forum appreciates questions where you can have a direct answer and not simply discuss things.

Thanks in advance for your help

parallel • 2.3k views
ADD COMMENTlink modified 15 months ago by ole.tange2.1k • written 5.3 years ago by Andrea_Bio2.2k

Somewhat tongue in cheek but I found the following a great "guide"

ADD REPLYlink written 5.3 years ago by Istvan Albert ♦♦ 60k

Cross-posted to Quora:

ADD REPLYlink written 5.3 years ago by Mndoci1.2k

If you buy a quad-core or 6-core machine to work off of, you won't have to emulate a multi-processor environment in software. That's entry-level hardware these days.

ADD REPLYlink written 5.3 years ago by David Quigley9.8k

thank-you everyone for your answers. It is much appreciated

ADD REPLYlink written 5.3 years ago by Andrea_Bio2.2k
gravatar for Kraut
5.3 years ago by
Pittsburgh, PA
Kraut220 wrote:

An important part of any parallel programming is thoughtful domain decomposition. In biology you should look for natural parallelism in the objects themselves. For next-generation sequencing, that means parallel by sample, flowcell, tile, or lane. For other problems it's by gene, chromosome, domain, or conformation.

Once you've decomposed the problem you can choose a technology that enables you to express that parallelism by message passaging, map reduce, or batch processing.

ADD COMMENTlink written 5.3 years ago by Kraut220

Couldn't have put it better. Ben Langmead talks about this quite eloquently about this in his talks on Myrna. The good thing about Hadoop and similar frameworks is that there you don't really have to worry about the non-domain aspects of parallelism.

ADD REPLYlink written 5.3 years ago by Mndoci1.2k

Nicely put - I'd also add the natural parallelism inherent in statistical approaches (resampling, permutation, parametrization, cross-validation)

ADD REPLYlink written 5.3 years ago by Hanif Khalak1.2k
gravatar for Chris Miller
5.3 years ago by
Chris Miller15k
Washington University in St. Louis, MO
Chris Miller15k wrote:

A little searching will get you tons of how-tos talking about the map-reduce paradigm, but I've always found the terminology confusing. It's easier for me to think about it as a three step process: split, process, and combine

  • Split means figure out how to subset your data into chunks
  • Process means run some scripts to do the computation on each chunk in parallel
  • Combine means take the results and put them back together.

This is going to be the basic framework for pretty much all of your parallel scripts. Wrap your head around this and you're 90% of the way there.

Some other scattered thoughts:

  • Don't worry. It takes time to train yourself to see problems in this way. Once you get it, though, you start seeing the patterns everywhere.

  • If data is relatively small, don't waste time parallelizing your code. As a bioinformatician, your goal is usually to get the data munged quickly, rather than producing perfect and optimized code.

  • If you've got big data that's easy to split (like your 12M SNPs), think about what level is easiest to split at. You can physically split the data into seperate files, then launch the same script on subsets of the data. Alternately, you can parcel out different threads or processes from within your scripts.

    I usually use the first approach. Got an 8-core machine (or a cluster)? Use split to chop the data into smaller chunks, then farm out the processes using a for loop or gnu parallel. Cat the results together and you're in business.

  • Since you're working with genomic data, processing each chromosome independently is often a easy and natural way to split the data.

  • Systems like Hadoop are great too, but learning a system like that is probably overkill, at least at first.

  • If you're using R on a single multicore machine, I can recommend the doMC/foreach packages. They're fairly intuitive and work well.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Chris Miller15k
gravatar for Bio_X2Y
5.3 years ago by
Bio_X2Y3.2k wrote:

If you're just doing once-off jobs for yourself, I'd echo Istvan's sentiments and suggest you make use of simple techniques where feasible (if that's what he's suggesting!)

e.g. in the SNP example, if you have four processors, maybe you can just break the file into 4 smaller files, each with 3 million SNPs, and kick off four instances of the SNP script, each reading from its own input file and writing to its own output file.

I've rarely dabbled in fully-fledged parallel programming, but I've always found it time-consuming and error-prone. I almost always try to avoid it these days, opting instead to invoke smaller jobs in parallel from the command line.

ADD COMMENTlink written 5.3 years ago by Bio_X2Y3.2k

I agree, when exploiting straighforward data parallelism - it pays to think outside the bun.. Not all HPC can be done one one machine and one datastore though, which might take you MPI and beyond

ADD REPLYlink written 5.3 years ago by Hanif Khalak1.2k
gravatar for Ketil
5.3 years ago by
Ketil3.6k wrote:

I think you'll find that doing things in parallell requires you to learn a lot of technicalities. I'd advise you to concentrate on the low hanging fruit. To me, these are e.g. simple shell-scripts running sub-processes in parallel:

# process databases in parallel
for d in db1.fasta db2.fasta db3.fasta; do
    blastall -i input.fasta -d $d -o $ -m 8 & 
# process output here

Or, if you can structure your pipeline in a makefile, you can use 'make -j' to build your targets in parallel. (A makefile uses a declarative language to express which targets (i.e. files) depend on which others, and will recursively build the dependencies of the target you specify. This lets 'make' parallelize a lot of the task automatically.)

Writing effective multi-threaded programs is quite difficult, and introduces lots of new ways for your program to fail. For getting good performance, you usually need to spend a lot of time making sure everything is properly balanced.

ADD COMMENTlink written 5.3 years ago by Ketil3.6k
gravatar for ole.tange
15 months ago by
ole.tange2.1k wrote:

The wonderful thing about bioinformatics is that many of the problems are embarrassingly parallelizeable. This makes it possible to reuse your sequential programming skills and let a general parallelizer such as GNU Parallel deal with the parallelization. Even if you made a specialized parallelized tool you will often not get any noticeably speed improvement over GNU Parallel for that kind of problems.

GNU Parallel has been used by bioinformaticians for years, and several of the options have been developed with bioinformatics in mind. The hard part is to understand how to use it most efficiently. These examples should get you started:


ADD COMMENTlink modified 15 months ago • written 15 months ago by ole.tange2.1k
gravatar for Burke
5.3 years ago by
Bethesda, MD
Burke250 wrote:

A good book, that covers a little of every aspect of parallel computing, is "Scalable Parallel Computing"

The book introduces concepts such as SIMD - single instruction, multiple data - that sounds like it is appropriate for your situation.

While the 2 reviews are from people who did not like the book, for $3 - 4 bucks what do you have to lose?! :-)

ADD COMMENTlink written 5.3 years ago by Burke250
gravatar for Austinlew
5.3 years ago by
Austinlew230 wrote:

this is a short tutorial to show how to make a multi-threaded Perl program, hope you will get some idea from it.

ADD COMMENTlink written 5.3 years ago by Austinlew230
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 806 users visited in the last hour