Question

Is genome a bottleneck in aligning sequences?

0

Entering edit mode

15 months ago

Geoffrey • 0

I am a wet lab biologist who has dabbled in bioinformatics in the past. I have just started working with WGS data on the scale where I don't want to align individual samples sequentially and instead want to do it in parallel to save time.

I thus call my alignment script (read trimming, BWA, samtools sort, PCR deduplication, etc.) on each sample and launch an independent job using Slurm on my institutions HPC system. It works fine but takes significantly longer than I anticipated. Is this likely to be due to all the samples being aligned against the same genome. I imagine the rate at which alignments can progress is slowed by all the jobs competing to read the same files. If so, would things be sped up by having multiple instances of the same genome and splitting up the samples amongst them?

Slurm HPC WGS sequencing BWA • 759 views

ADD COMMENT • link updated 15 months ago by ATpoint 82k • written 15 months ago by Geoffrey • 0

0

Entering edit mode

I imagine the rate at which alignments can progress is slowed by all the jobs competing to read the same files.

hum... not really. It depends on your number of nodes, on the memory available, on the number of CPUS allocated for each job, the I/O speed, etc...

ADD REPLY • link 15 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

The jobs are all have exactly the same number of nodes, CPUS and amount of memory allocated. All the libraries are similar sizes (+/- 10 million reads). When I run one of the jobs by itself it takes ~5.5 hours. When I launch 10 at once it takes 19-20 hours.

ADD REPLY • link 15 months ago by Geoffrey • 0

score 1 · Answer 1 · 2023-02-11

1

Entering edit mode

15 months ago

Mensur Dlakic ★ 27k

When I run one of the jobs by itself it takes ~5.5 hours. When I launch 10 at once it takes 19-20 hours.

The most likely reason is the number of read/write operations of 10 jobs vs 1. If the data can be read in faster than processors can align them, or close to it, the job runs without much of a bottleneck. If 10 processes are reading from the disk simultaneously, the combined read rate may be greater than the actual disk read rate, which means disk I/O will be slow, and the processor will have to wait for the data.

It is possible that running 3-5 jobs at a time will finish faster than running 10 at once.

ADD COMMENT • link 15 months ago by Mensur Dlakic ★ 27k

1

Entering edit mode

Seconding this, it's almost certainly the overall disk I/O. Not much to do about that. Some HPC nodes have local SSDs attached one could use to temporarily host input data via systems like beeond, but I would probably and simply wait for completion, it's only a day, so what? I would only take action if completion is critical in short time and you do this on a daily basis.

The 'genome' (that is the alignment index) is usually no bottleneck as it gets loaded is to memory.

ADD REPLY • link 15 months ago by ATpoint 82k