I am a wet lab biologist who has dabbled in bioinformatics in the past. I have just started working with WGS data on the scale where I don't want to align individual samples sequentially and instead want to do it in parallel to save time.
I thus call my alignment script (read trimming, BWA, samtools sort, PCR deduplication, etc.) on each sample and launch an independent job using Slurm on my institutions HPC system. It works fine but takes significantly longer than I anticipated. Is this likely to be due to all the samples being aligned against the same genome. I imagine the rate at which alignments can progress is slowed by all the jobs competing to read the same files. If so, would things be sped up by having multiple instances of the same genome and splitting up the samples amongst them?
hum... not really. It depends on your number of nodes, on the memory available, on the number of CPUS allocated for each job, the I/O speed, etc...
The jobs are all have exactly the same number of nodes, CPUS and amount of memory allocated. All the libraries are similar sizes (+/- 10 million reads). When I run one of the jobs by itself it takes ~5.5 hours. When I launch 10 at once it takes 19-20 hours.