Question

Disk I/O Bound Genome Analysis Application

3

Entering edit mode

12.6 years ago

User 2724 ▴ 30

My institute just invested a new cluster equipped with a lustre file system. We would like to test the performance of the disk I/O with some popular bioinformatics applications such as genome alignment/mapping, file format converting (sam to bam), SNPs, etc.. But I am a computer science person with not much knowledge about the bioinformatics applications. Could someone suggest us with some genome analysis applications which will produce lots of disk I/O in a very short time. Thanks a lot.

• 3.2k views

ADD COMMENT • link updated 12.6 years ago by User 59 13k • written 12.6 years ago by User 2724 ▴ 30

score 2 · Answer 1 · 2011-10-06

2

Entering edit mode

12.6 years ago

Sean Davis 26k

Assuming that you have a batch system (SGE, PBS, Torque, etc.), simply submit a bunch of read/write jobs such as sam to bam is probably a useful test. On our cluster, concurrent writes are the major performance bottleneck.

ADD COMMENT • link 12.6 years ago by Sean Davis 26k

0

Entering edit mode

Thanks. File format converting will be one of our target.

ADD REPLY • link 12.6 years ago by User 2724 ▴ 30

score 2 · Answer 2 · 2011-10-07

2

Entering edit mode

12.6 years ago

User 59 13k

Is there something wrong with using Bonnie++ or Iozone for this?

ADD COMMENT • link 12.6 years ago by User 59 13k

0

Entering edit mode

I would second this suggestion. This is also much more complete a benchmark than just trying some I/O heavy bioinformatics application.

ADD REPLY • link 12.6 years ago by Farhat ★ 2.9k

0

Entering edit mode

Thanks a lot for your suggestion. The reason I am asking help here is that our cluster is aimed to provide services to our biology department. So we would like to demonstrate the capacity of this cluster to biology people in more friendly and easy understanding language. We think using an example of a popular application is the best way to show them the difference. Thanks again

ADD REPLY • link 12.6 years ago by User 2724 ▴ 30

0

Entering edit mode

I wouldn't necessarily worry about demonstrating how great your disk IO is to your biologists. They're probably far more concerned about how fast you can return their results, not how fast you can write to the filesystem.

ADD REPLY • link 12.6 years ago by User 59 13k

0

Entering edit mode

You usually want normal/standard benchmarks like SPEC, bonnie and the like, but it's always good to do a synthetic, almost real-world test to see if the hardware supports what you really do day-in day-out.

ADD REPLY • link 12.5 years ago by Louis Letourneau ▴ 820

score 0 · Answer 3 · 2011-10-06

I had to write a similar benchmark.

I wrote a random fastq generator that follows the illumina hiseq error rate (kind of). Then I ran the first step of any pipeline, clipping adapters + read quality trimming.

It seems to do the job of stressing IO pretty nicely. The rest of the pipeline, Alignment, Realignment, snp calling, etc is pretty much CPU intensive, less IO (~80-20).

The trimming application was fastx_clipper, fastx_quality_trimmer, EMBOSS for quality conversion and gzip for obvious reasons.

The rebalancing of read pairs (sorting paired vs single reads after these steps) in this pipeline is a custom made script.