Bowtie multi-threads and SSD vs spindle disk performance
1
0
Entering edit mode
5.9 years ago
agbiotec • 0

Hello,

 

 

 

I was wondering if anyone knows any studies (papers conference / journal, blog, other) for bowtie or bowtie2 performance in multi-core (core i7 6x) systems, and how performance is affected with using SSDs vs spindle disks. Also how much memory ideally per core, when running multi-threaded alignment against human genome indexes.

I am looking to build a box for the lab, and trying to figure out 2 things : how much memory per core, and whether there is significant advantage with SSD drives (given that TB storage is required for our sequencing and SSD is expensive!).

Also any ideas the group might have if SSD offers 100x speedup for bowtie, how to combine with large spindle storage. I am basically looking for a tower box in the lab, and while I could get an external spindle disk array and keep the SSD in the box, I want to avoid staging data back and forth.

 

 

 

 

alignment computing hpc next-gen • 2.4k views
ADD COMMENT
0
Entering edit mode

What is the workflow you will be supporting downstream of bowtie2? I've found the alignment steps to not typically be my bottleneck. There could be a big difference in the answers provided depending if you are ultimately just doing the alignments versus if you are doing mostly genotyping/variant discovery downstream, transcriptomics, etc.

SSD will only provide speed-ups to the portion of your workflow that are I/O intensive. In my experience mappers tend to do a lot in memory before writing out to SAM/BAM files so I'm not sure you would see a lot of speedup at the bowtie2 stage.

ADD REPLY
0
Entering edit mode

Thank you for your reply, downstream it will be TopHat / Cufflinks (typical pipeline for differential expression using RNAseq data).

ADD REPLY
1
Entering edit mode
5.9 years ago
DG 7.2k

Here is a paper in ArXiv profiling speed-ups from SSD drives in a variety of bioinformatics workflows, including RNa-Seq: http://arxiv.org/abs/1502.02223

And a post from Brad Chapman: http://bcb.io/2014/12/19/awsbench/. Brad's isn't specific to SSD as it is a large-scale benchmark on Amazon using Docker containers but the use of SSD storage by amazon on the backend for their high-end file systems is one factor that goes into the speed there so it might be worth reading. It also gives you some ideas of costing with Amazon for renting large-clusters and storage, which is something to seriously consider.

If you are looking at building a single machine to use in the lab the first thing to consider is throughput. What size sequencer are you supporting and is that sequencer in the lab or just one that you have an affiliation with and will be using a lot? If you're supporting a HiSeq in production you need some serious hardware investment, and a single server, no matter how large, just won't do. So even with the smaller machines (MiSeq for instance) how many runs you expect to do per month or per year factors into deciding how much storage space you really need.

In general I would recommend as much RAM as you can afford and as many processing cores as you can afford at a reasonable speed. You probably want to have at least 10-12GB RAM/processing core as the minimum. Definitely intel chips, the newer V3 specs if possible. Some tools have been compiled using the intel compilers and can take advantage of the newer AVX instruction set and offer significant speed-ups.

You're right that probably you want to stage your data storage in tiers. Use SSDs for the active processing and storing all of the resource files (reference genome, etc) that you use as read-only constantly. You may even want these separate from one another. Then fill the system up with as much storage on regular spinning disk as you can afford, and decide what RAID level or other method of redundancy you want to use.

If you want to look at getting boat-loads of storage at a low price point I'd recommend a company called 45Drives: http://www.45drives.com/. 45 Drives is the company that was working with the online storage company Backblaze to build ultra-dense storage solutions. They are becoming quite widely known in the genomics community as well, and a number of sequencing centres are using their gear. Basically you can fit 45 drives (commodity you buy yourself or they also resell Western Digital datacentre drives) into a single chassis. With expensive 8TB drives thats 360TB of raw storage. More reasonably priced 4TB drives you can get up to 180 TB. They don't currently support SSDs though, except I think as the two OS drives you can have in the system in addition to the storage array. Full disclosure I'm also a recent customer of theirs but have no other incentive or anything like that for recommending them. They do custom gear and builds as well, so you could always design a pretty hard-core computing server with massive amounts of storage in one box or you could buy one off the rack as a very high-density NAS server connected to your processing server. I'm sticking three of their units into a cluster with other compute nodes myself.

ADD COMMENT
0
Entering edit mode

Dan: Thanks for linking the paper from ArXiv.

I have not read the paper in detail but based on results in Table 1 it is clear that SSD's don't offer any improvement with NGS aligners.

Something just does not feel right about the choice of hardware/software used in this study. Authors are using 128GB SSD drives which generally have lowest performance (because of the number of dies) of capacities available. The only compute hardware described also appears to be fairly low end.

We used a machine equipped with a 3.3GHz Intel Core i3-3220 CPU (4 threads, 4MB L3 cache), 1600MHz dual-channel DDR3 memory (4GB for the GATK tools and 1GB for the others), and Ubuntu 12.04 LTS (Precise Pangolin) 
ADD REPLY
0
Entering edit mode

I also haven't read it in detail, but I agree their test was on fairly low end hardware. That said I would look for tools where you do see performance increases, but that will typically only be with tools that write a lot of temp files or do a lot of reading in stages.

My general suggestions were based off of personal experience. If you can afford SSDs they are good to fit into your workflow where you can. Your large storage is, for the moment, going to be standard spinning disk though unless you have a ton of money to spend. It just isn't economical to build TBs of SSD storage right now.

ADD REPLY

Login before adding your answer.

Traffic: 2749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6