Performance Bottlenecks In Next Generation Sequence Analysis Pipeline
2
4
Entering edit mode
10.2 years ago
Daniel Chubb ▴ 40

Hi,

Has anyone who has used the GATK pipeline and/or sequence alignment tools such as BWA and stampy have a feel for what the worst bottlenecks are in performance? As I only have access to the one system that I am running it on (a SGI ICE cluster with network attached storage) it is hard to get a feel for what might improve it.

I/O seems to be a big issue in most of the processes I run and I wonder whether running on a machine with less cores but direct attached fast (raided/striped) storage would be faster.

Anyway, it's a bit of an unfocussed question, I just wonder if anyone had any real world experience of trying to run these processes in different scenarios and what they have found works best.

Thanks a lot

Dan

gatk alignment next-gen sequencing pipeline • 2.5k views
ADD COMMENT
1
Entering edit mode

Is the IO sequential or random? Also what kind of interconnect do you have to the NAS? Presumably it's using NFS or something similar?

ADD REPLY
4
Entering edit mode
10.2 years ago
Mdeng ▴ 510

Hi Dan,

you are right. I/O is the real bootle neck. First of all we used just normal SAS/SATA storage systems, with raid 5. Now I installed 2 other storage systems.

2x 200GB @ Raid 1 SAS SSD for read, mount to /tmp/read

2x 200GB @ Raid 0 SAS SSD for write, mount to /tmp/write

For even more performance, you could attach each of these systems to a separate controller. I am running them on perc h700 (from dell). Putting I and O on different storage systems increases the speed in a good way, but yea... You could even spend more money on high speed storage. For our specific systems, this config works really well.

with best,

Mario

ADD COMMENT
0
Entering edit mode

Using Raid 5 and mounting /tmp on ramdisk ?

ADD REPLY
0
Entering edit mode

What Do you mean by mounting /tmp on ramdisk? I am just using my Raid 1 and 0 levels for I/O.

ADD REPLY
0
Entering edit mode

why RAID 5? that isn't good for writing performances ... I've begun some testing on a 12 disk system with RAID 1+0

ADD REPLY
0
Entering edit mode
10.2 years ago
User 59 13k

The biggest improvements we've seen so far have been the ones that increase I/O performance. The GATK built-in parallelism has given us some gains, but needs to be benchmarked reasonably carefully, as more cores doesn't equal faster performance under all situations.

ADD COMMENT

Login before adding your answer.

Traffic: 1253 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6