Question

multiprocessing versus multithreading

0

Entering edit mode

11 months ago

samuel ▴ 240

Could someone explain in very simple terms the difference between the two to a newbie bioinformatician.

I have read several blogs and and still slightly confused....I have the following understanding (please correct me if I'm wrong):

A) Threading is one instance of a process. Multiple threads in a process share the same memory space. An analogy would be if you had one room to clean, you could chop yourself into three and each part cleans a third of the room.

B) Parallelisation involves creating several instance of the same process - each process has it's own memory space. An analogy would be if you had three rooms to clean you would clone yourself three times and each clone cleans a room.

Let's say you have multiple cores on a machine. If I have several samples to run through several steps like trimming, alignment, variant calling and each software has the option for threading - I write a simple bash script to loop through each sample using say -t 20 for each software to speed things up.

Would the only way to parallelise things to be to use something like snakemake?

I then read about 'race condition' from here Multithreading vs Multiprocessing where multiple threads try to change the same variable simultaneously. Am I risking this by using -t20 with my simple bash script?

parallel processing multithreading • 1.1k views

ADD COMMENT • link updated 11 months ago by Matthias Zepper 4.5k • written 11 months ago by samuel ▴ 240

0

Entering edit mode

Parallelisation involves creating several instance of the same process

not always. It's also sending jobs on the differents nodes of a cluster

ADD REPLY • link 11 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Could you explain what that means please maybe in the context of my examples?

ADD REPLY • link 11 months ago by samuel ▴ 240

score 6 · Accepted Answer · 2023-05-03

Well, those are not exactly distinct categories, using multiple threads can be a way of achieving parallelization / concurrency (these terms are colloquially used interchangeably). Also, a process can refer to a system process in the operating system or also to processes in e.g. Nextflow or Erlang.

Concurrent computation is a topic, where one can easily go down a rabbit role, so I will only scratch the surface here and particularly focus on the data aspect of it, since this is also the most relevant for a bioinformatician.

Let's assume you have done an RNA-seq experiment on 10 samples. You could now process the data serially, that is one sample at a time, until you are finished. However, this would take quite long and so you are looking to parallelize the computation. This can be safely done, because the computations on each of the samples can safely run in parallel without affecting each other, so if you have enough hardware resources available, things can be sped up.

Yet, for each sample, the compute steps are still executed serially. The files are first decompressed and then aligned and quantified etc. But upon a closer look, you might realize that e.g. a tool doing quality control just reads the alignment and so does the tool doing the quantification. So once the alignment has been created, you could also run both, quality control and the quantification, at the same time for the same sample. So in addition to parallelizing over samples, you can also parallelize compute steps during the analysis of the same sample to get your results even faster.

This is the level of parallelization that Snakemake and Nextflow are doing - they parallelize over samples and run compute steps concurrently that are independent of another. Workflow managers are excellent at this task, so I recommend using those over any self-written bash script - unless you want to learn. In that case, you can use named, buffered pipes (FIFOs) to chain suitable programs (mkfifo).

But let's go a step further and have a look at single reads. Alignment is a computationally quite expensive operation, but each record in the FastQ file (cDNA fragment read) is independent of each other. So while the aligner is searching for the correct position of one DNA fragment, it would be possible to concurrently search the position of other fragments as well. This is a prime example where multi-threading is used: The aligner employs several threads (CPU cores) to compute the alignment of multiple fragments in parallel and the only tricky part is writing the output file, because here the results from several threads need to be joined into one output stream. However, there are good solutions to this problem and writing data to a file is usually way faster than the alignment, such that the results from several threads can be safely written without issues. When inspecting the output file, you will notice that the reads are no longer ordered in exactly the order of the input, because all aligners write the results as soon as they are available without "waiting" for an earlier read that might need longer to align. Typically, the output will later be sorted anyway, so preserving the order is dispensable relative to the performance gains.

To sum up, as long as there is a distinct data entity (sample, record, read, gene, chromosome, file) that allows to assign a "piece of data" exclusively to a particular operation, it is easily possible to parallelize said operation over those isolated entities. There is also no real problem, if operations need to share a piece of data read-only - the worst that can happen is that this piece of data is kept in memory under the assumption that another operation might need it in the future, while it could already have been deleted.

A data-race occurs, when mutable data is shared that can/will be altered by the computation. At best, this can lead to the two threads/processes/operations having different versions of the data. At worst, simultaneous updates can interfere with each other in deleterious ways, resulting in the data being overwritten with garbage.

The standard way to avoid these problems is to use locks to prevent data updates from happening at the same time, so to make a shared piece of data temporarily exclusive. However, this is very difficult to get right, so it causes lots of bugs (deadlocks etc.) and also performance hits. Languages like Pony or Rust are therefore lock-free, but have complex types that allow fine-grained control over mutable data (see the Pony reference capabilities or a nice tutorial video about Rust ownership and borrowing).

But this is really only relevant if you are writing your own software. As long as you are just running other peoples' code, the operating system and the workflow managers will take care to keep things isolated - as long as you are not e.g. specifying the same output file for two tools.