From my experience this is extremely difficult to get right, and almost impossible without access to the specific cluster and billing system. Your billing system looks relatively straightforward, but take a close look at the fine-print, do they charge extra for GPU/storage/memory usage? Also, they are billing based on wall-clock time, which is unfair but common. For example, if you utilize your allocated CPUs inefficiently and the jobs are mostly spending time idling or in IO wait, you will still get billed the full wall-clock time, CPUs and max Memory (running Alphafold for 10 hours costs the same as running sleep
). Also, the run-time and memory costs depends on the size of the datasets. You cannot really extrapolate the run time from another stand-alone PC or cloud, because the components may have widely different performance.
The best approach is to go to that cluster and run realistic test cases for each analysis. Prepare, for example, a representative RNA-seq and ChIP-seq dataset with the same reference genome and depth as expected. Install or locate all the tools required and set up your analysis pipeline. Then run the jobs on the cluster using the cluster manager and submit the job with sbatch
. After the jobs have finished, make sure the result is correct and complete (e.g. not killed due to wall-clock limit or memory).
Assuming the batch system is SLURM, there is a tool called sacct
. Call it with the ids of the test jobs using this format
user@uan02:~> sacct -j 123456 --format JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime
JobID JobName Elapsed NCPUS TotalCPU CPUTime
------------ ---------- ---------- ---------- ---------- ----------
123456 af_monomer 01:45:38 896 06:41:03 65-17:27:28
123456.bat+ batch 01:45:38 112 06:41:03 8-05:10:56
Use the last column (CPUTime) and multiply it by the estimated number of samples you are going to process in total. Now you've got the estimated (CPU_pw, for perfect world) CPU hours if everything went perfect, of course it never does (running out of time, process killed, file not found, but billed anyway). So you need to multiply CPU_pw with a safety factor (SF). I'd estimate one will at least run each pipeline twice, so say 2-10, if you need a lot of experimentation and optimization to get your pipeline to run or adapt to different input data.
Sum (CPUTime) over all pipeline types * # samples * SF * 0.01£ + eventual storage cost = estimated cost (and round up to the next 1000£ or so)
I am guessing you will land at a few 1000 tops, still cheaper than buying a machine with the suitable specs.
Hope this helps.
Thank you all, that was very helpful and gave me good insights on how to think about it and proceed with it. I realized that my university charges for a total of 48 cores regardless of whether I use 1 core or 48, so that made the total cost per year around £4200 (48 x 24 x 365 = 420480 hours so for £0.01/hour, that was around £4200). I'm not sure if that sounds excessive, but if it does, then I can always half it and it should sound reasonable, right?
Do you also happen to have any thoughts on storage requirements? My university offers 5TB of storage as a standard and then charges £250/TB/year. I am again struggling to think about how much storage space I would need to request for, for all my experiments. My current plan involves doing a total of around 50 experiments (ChIP-Seq, RNA-Seq, etc.). If I recall correctly, I think just storing the raw data and any intermediate files generated would be around 300-400 GB by experiment - does that sound about right? If yes, then it'd be 400 GB x 50 experiments = 20000 GB = 20 TB? So I'd need to purchase an extra 15 TB for the duration of the grant?
That is a very generic specification. How many samples are going to be in each experiment since that is what the minimum denominator is going to be to estimate compute/storage etc. How many million reads do you plan to sequence per sample, would they be paired end and how many cycles?
4200 pounds sounds perfectly reasonable for this volume of data.
If that is reflective of your current usage (and you are planning to continue using the same strategy) then yes.
Sorry for missing out on these details. They would be 8 samples per experiment, 30 million reads, PE 150.
Thank you for your help.
That's equivalent to a single novaseq run (for certain systems). A few terabytes should work for that.
Five terabytes will be enough to store -- but you just have to be frugal about storing intermediate files (e.g. deleting your adapter-trimmed FASTQ files after you're done using them; compressing all files where possible; etc.). If you're having 400 gigabytes per experiment, that means you're using a lot (possibly excessive) of intermediate files.
This doesn't sound right, even if these are ultra-fast disks. I bought a 10 TB disk for about the same amount you will be paying for 1 TB/year. Not a very fast disk, but still. Our high performance storage system is $30/TB/year.
This is very similar to the cost of storage at my insitution. We are charged £100/TB/copy/year, but they only really do things in 2 copies, that that comes in at £200/TB/year. There is a special deal here, where if you buy 5 years at once, then they store the data indefinately.
The cost of highly parrallel cluster-out storage systems is more or less completely unrelated to the cost of directly attached disks, or even comsumer-grade NAS, where each disk platter can generally only by accessed by one user at once.
Once, I had the same thought as you. As I was in the process of applying for a grant to build a bioinformatics specific HPC, I thought I'd bypass the university and get our own storage, and got quotes directly from the supplier. You paid £30k for each 75TB disk shelf (allowing you 37.5TB of storage, when set up with mirroring). You then need at least one management node and a cache shelf. Finally a service contract - at the time each of those 75TB shelves is built of 2.5TB disks. Thats 30 disks per shelf. If you have (as we were planning) 6 shelves, thats 180 disks (not counting the cache). That number of disks means that failure of at least one disk is quite common. With the service contract, the new disk arrives between one week and one day before the disk fails, and you can just pull the old one out, and pop the new one in when the failure happens, without taking the system off line, and its just automatically repopulated from the mirror.
Turned out that the captial cost was almost exactly £1000/TB, which makes £200 per mirroed TB a pretty good deal (based on the system lasting 5 years).
This is copied directly from the website:
If you are doing 50 experiments x 8 samples = 400 samples, I don't think that 48 cores sounds that unreasonable, but yes, you could always ask for it for half the time.
For disk, you could probably fit onto 10TB if you needed, but why bother trying, and risk having to delete a file you might need later. I usually buget 5TB for a project, but my indeivual projects are (usually) smaller than 400 samples. I don't think that 20TB is particularly excessive, although you could go for 15TB (300GB per experiment).
One way to look at this is you are not trying to decide how to make a particular amount of money go a long way, but rather how much money to ask for to do something properly.
If you are quoting in £s, then the chances are you are submitting to a UKRI grant call. I've never had my computational costs queried on a UKRI grant. Even being quite generous, they are going to pale compared to the total costs of the grant. UKRI grants are generally in the £500k-£800k bracket. Even if you ask for 48 cores for the full length of the grant and 10TBs, thats still only around £25k. The cost of the sequencing you are talking about would be around £72k alone.