Is genome a bottleneck in aligning sequences?
1
0
Entering edit mode
15 months ago
Geoffrey • 0

I am a wet lab biologist who has dabbled in bioinformatics in the past. I have just started working with WGS data on the scale where I don't want to align individual samples sequentially and instead want to do it in parallel to save time.

I thus call my alignment script (read trimming, BWA, samtools sort, PCR deduplication, etc.) on each sample and launch an independent job using Slurm on my institutions HPC system. It works fine but takes significantly longer than I anticipated. Is this likely to be due to all the samples being aligned against the same genome. I imagine the rate at which alignments can progress is slowed by all the jobs competing to read the same files. If so, would things be sped up by having multiple instances of the same genome and splitting up the samples amongst them?

Slurm HPC WGS sequencing BWA • 759 views
ADD COMMENT
0
Entering edit mode

I imagine the rate at which alignments can progress is slowed by all the jobs competing to read the same files.

hum... not really. It depends on your number of nodes, on the memory available, on the number of CPUS allocated for each job, the I/O speed, etc...

ADD REPLY
0
Entering edit mode

The jobs are all have exactly the same number of nodes, CPUS and amount of memory allocated. All the libraries are similar sizes (+/- 10 million reads). When I run one of the jobs by itself it takes ~5.5 hours. When I launch 10 at once it takes 19-20 hours.

ADD REPLY
1
Entering edit mode
15 months ago
Mensur Dlakic ★ 27k

When I run one of the jobs by itself it takes ~5.5 hours. When I launch 10 at once it takes 19-20 hours.

The most likely reason is the number of read/write operations of 10 jobs vs 1. If the data can be read in faster than processors can align them, or close to it, the job runs without much of a bottleneck. If 10 processes are reading from the disk simultaneously, the combined read rate may be greater than the actual disk read rate, which means disk I/O will be slow, and the processor will have to wait for the data.

It is possible that running 3-5 jobs at a time will finish faster than running 10 at once.

ADD COMMENT
1
Entering edit mode

Seconding this, it's almost certainly the overall disk I/O. Not much to do about that. Some HPC nodes have local SSDs attached one could use to temporarily host input data via systems like beeond, but I would probably and simply wait for completion, it's only a day, so what? I would only take action if completion is critical in short time and you do this on a daily basis.

The 'genome' (that is the alignment index) is usually no bottleneck as it gets loaded is to memory.

ADD REPLY

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6