Forum:Determining computer time required
5
2
Entering edit mode
6 months ago
Rozita ▴ 40

Hello,

I am writing a grant and I would like to include the costs required to run programs on the high performance computers. In the institute where I work, they charge £0.01 per core hour which is defined as the equivalent of 1 processing core being used for 1 hour of wall clock time.

I'm planning to do several RNA-Seq and ChIP-Seq experiments (and alternatives to that, with similar principles) and I'm unable to determine how much core hours I would need to use as there will be some custom programs that I will need to use for downstream analysis. But at least for the standard: mapping, trimming, fastqc, etc. pipeline, how can I determine how much time would be required?

Perhaps from your experience in analyzing such experiments, how much total computer time does it require and I will then extrapolate and estimate?

Alternatively, are there any alternatives to using the university's HPC system?

Thank you.

computer hpc time • 2.5k views
0
Entering edit mode

Thank you all, that was very helpful and gave me good insights on how to think about it and proceed with it. I realized that my university charges for a total of 48 cores regardless of whether I use 1 core or 48, so that made the total cost per year around £4200 (48 x 24 x 365 = 420480 hours so for £0.01/hour, that was around £4200). I'm not sure if that sounds excessive, but if it does, then I can always half it and it should sound reasonable, right?

Do you also happen to have any thoughts on storage requirements? My university offers 5TB of storage as a standard and then charges £250/TB/year. I am again struggling to think about how much storage space I would need to request for, for all my experiments. My current plan involves doing a total of around 50 experiments (ChIP-Seq, RNA-Seq, etc.). If I recall correctly, I think just storing the raw data and any intermediate files generated would be around 300-400 GB by experiment - does that sound about right? If yes, then it'd be 400 GB x 50 experiments = 20000 GB = 20 TB? So I'd need to purchase an extra 15 TB for the duration of the grant?

0
Entering edit mode

My current plan involves doing a total of around 50 experiments

That is a very generic specification. How many samples are going to be in each experiment since that is what the minimum denominator is going to be to estimate compute/storage etc. How many million reads do you plan to sequence per sample, would they be paired end and how many cycles?

4200 pounds sounds perfectly reasonable for this volume of data.

If I recall correctly, I think just storing the raw data and any intermediate files generated would be around 300-400 GB by experiment does that sound about right?

If that is reflective of your current usage (and you are planning to continue using the same strategy) then yes.

0
Entering edit mode

Sorry for missing out on these details. They would be 8 samples per experiment, 30 million reads, PE 150.

1
Entering edit mode

That's equivalent to a single novaseq run (for certain systems). A few terabytes should work for that.

Five terabytes will be enough to store -- but you just have to be frugal about storing intermediate files (e.g. deleting your adapter-trimmed FASTQ files after you're done using them; compressing all files where possible; etc.). If you're having 400 gigabytes per experiment, that means you're using a lot (possibly excessive) of intermediate files.

0
Entering edit mode

My university offers 5TB of storage as a standard and then charges £250/TB/year.

This doesn't sound right, even if these are ultra-fast disks. I bought a 10 TB disk for about the same amount you will be paying for 1 TB/year. Not a very fast disk, but still. Our high performance storage system is \$30/TB/year.

2
Entering edit mode

This is very similar to the cost of storage at my insitution. We are charged £100/TB/copy/year, but they only really do things in 2 copies, that that comes in at £200/TB/year. There is a special deal here, where if you buy 5 years at once, then they store the data indefinately.

The cost of highly parrallel cluster-out storage systems is more or less completely unrelated to the cost of directly attached disks, or even comsumer-grade NAS, where each disk platter can generally only by accessed by one user at once.

Once, I had the same thought as you. As I was in the process of applying for a grant to build a bioinformatics specific HPC, I thought I'd bypass the university and get our own storage, and got quotes directly from the supplier. You paid £30k for each 75TB disk shelf (allowing you 37.5TB of storage, when set up with mirroring). You then need at least one management node and a cache shelf. Finally a service contract - at the time each of those 75TB shelves is built of 2.5TB disks. Thats 30 disks per shelf. If you have (as we were planning) 6 shelves, thats 180 disks (not counting the cache). That number of disks means that failure of at least one disk is quite common. With the service contract, the new disk arrives between one week and one day before the disk fails, and you can just pull the old one out, and pop the new one in when the failure happens, without taking the system off line, and its just automatically repopulated from the mirror.

Turned out that the captial cost was almost exactly £1000/TB, which makes £200 per mirroed TB a pretty good deal (based on the system lasting 5 years).

0
Entering edit mode

This is copied directly from the website:

The cost of additional storage allocation above the standard 5TB per project allocation is £250 per TB per year. This cost is on top of that of compute time.

0
Entering edit mode

If you are doing 50 experiments x 8 samples = 400 samples, I don't think that 48 cores sounds that unreasonable, but yes, you could always ask for it for half the time.

For disk, you could probably fit onto 10TB if you needed, but why bother trying, and risk having to delete a file you might need later. I usually buget 5TB for a project, but my indeivual projects are (usually) smaller than 400 samples. I don't think that 20TB is particularly excessive, although you could go for 15TB (300GB per experiment).

One way to look at this is you are not trying to decide how to make a particular amount of money go a long way, but rather how much money to ask for to do something properly.

If you are quoting in £s, then the chances are you are submitting to a UKRI grant call. I've never had my computational costs queried on a UKRI grant. Even being quite generous, they are going to pale compared to the total costs of the grant. UKRI grants are generally in the £500k-£800k bracket. Even if you ask for 48 cores for the full length of the grant and 10TBs, thats still only around £25k. The cost of the sequencing you are talking about would be around £72k alone.

6
Entering edit mode
6 months ago

When I cost a grant, (my intitution charges the same as your insitition does for CPU time), I just cost for 16-20 CPUs for the whole length of the grant (cores x 24hrs x 365days). (it works out at just less than £1,500 a year for 20 high memory cores). Mind you, we are not charge for access, which is free, but rather for "priority access", so if we use more than we've costed, then we can continue to use, just with longer queue times.

Modern clusters can run remarkably quickly: we recently mapped 40 extra deep RNA-seq samples in 30 minutes (granted, we did assign the job upto 640 cores), that puts an upper limit of 320 cpu hours on the job (I suspect it was less, because I suspect not all 40 jobs were running for the whole 30 minutes).

This is much better than owning your own server - really don't recommend anyone does that if they have the option. Remember when costing up the relative prices to cost in your own time adminning the server. Even if it only takes 2 hours a week (which might be reasonable for an inexperienced person averaged over the whole length of a three year grant), thats still 312 hours. Your university probably charges at least £60/hr to a funder for your time, so thats an extra cost of at least £18,000 to your grant. Add in around £6000-7000 for the server. Even then you are getting a system with a single point of failure, probably connected to an ordinary power socket, not designed for such high draw, and with no uninteruptable power supply. On a cluster if a node goes down, then nobody notices and the others take the slack while the IT technicians repair the broken node. With your own server, if it goes down, then nobody gets to do any work until you've arranged (out of your own time) for someone to come fix it. Someone else also deals with data backup, RAID failures etc on a cluster. You also get proper data-centre aircon, so your system doesn't slow down when the office gets too hot.

1
Entering edit mode

The way you are presenting it, using a cluster seems like a no-brainer. And maybe it is in the UK, because things may have improved since my experience 20 years ago with IT guys at Edinburgh.

At our cluster one has to pay a lot more than a cent per CPU per hour to get 640 cores at a snap of fingers, because there are other guys who are willing to pay the same amount to get priority. My hypothetical server with 40 CPUs will do the job in (640/40) * 30 minutes, which is perfectly acceptable to me. It may make some people warm and fuzzy that a massive mapping job got done in 30 minutes, but for practical purposes - at least my practical purposes - it makes no difference if the same job gets done in a day instead.

To have a computer available all the time without competition with others is more important to me than having a huge cluster for a couple of burst jobs per year. To each their own.

2
Entering edit mode

I think it is very important to also take the situation of OP into account and adjust perspective:

• OP is writing a grant application to a funding agency which may have their own view on buying hardware vs. using institutional resources. Some grants might not even allow for purchasing hardware, just deprication.
• the institution already offers HPC for a competitive price
• most likely, the experimental costs outweigh computational costs

Therefore, it is secondary whether a general optimal solution is HPC or local machine. The question is to produce a safe and somewhat realistic and professional-looking cost estimate for the accountant to estimate direct costs to put in the estimate in the same way as for consumables and travel costs.

The main aim of OP is to get their application accepted in the first place. Once the project is funded, it might be possible to re-allocate between different posts as required. Therefore, it might also be good to talk to an accountant at your faculty for a more pragmatic way to address the problem.

1
Entering edit mode

My main reasons for recommending the institutional HPC over a private server are not to do with the amount of compute available. I agree that having 40 cores available without competition is more important than have 640, but having to wait in a queue. This is why I pay for 40 cores priority access at 1p per CPU hour, even though non-priority access is free here (well, not free, it comes from the overhead, which you pay irrespective of whether you use the cluster).

I talked about the speed of the cluster mostly as an aid for calculating the costs the OP would need to put on their grant.

No, my main reason for suggesting that people use the HPC the cost of administering your own server, which, even if you are moderately skilled can be very expensive, once the full economic/opportunity costs are taken into account. As I said, even at 2 hours a week, the cost of your time massively dwarfs the cost of the hardware. Indeed, the cost of HPC would have to be nearly 6x as much before it became cheaper to admin your own server.

I'm luck that IT here is fairly good. I've been at places where the instutional HPC teams were less helpful. The group I was in there ran their own compute resources. But that entailed hiring our own sys admin, at the cost of £80k a year (£35k salary + £20k on-costs + £15k overheads).

4
Entering edit mode
6 months ago
Michael 54k

From my experience this is extremely difficult to get right, and almost impossible without access to the specific cluster and billing system. Your billing system looks relatively straightforward, but take a close look at the fine-print, do they charge extra for GPU/storage/memory usage? Also, they are billing based on wall-clock time, which is unfair but common. For example, if you utilize your allocated CPUs inefficiently and the jobs are mostly spending time idling or in IO wait, you will still get billed the full wall-clock time, CPUs and max Memory (running Alphafold for 10 hours costs the same as running sleep). Also, the run-time and memory costs depends on the size of the datasets. You cannot really extrapolate the run time from another stand-alone PC or cloud, because the components may have widely different performance.

The best approach is to go to that cluster and run realistic test cases for each analysis. Prepare, for example, a representative RNA-seq and ChIP-seq dataset with the same reference genome and depth as expected. Install or locate all the tools required and set up your analysis pipeline. Then run the jobs on the cluster using the cluster manager and submit the job with sbatch. After the jobs have finished, make sure the result is correct and complete (e.g. not killed due to wall-clock limit or memory).

Assuming the batch system is SLURM, there is a tool called sacct. Call it with the ids of the test jobs using this format

user@uan02:~> sacct -j 123456 --format JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime
JobID           JobName    Elapsed      NCPUS   TotalCPU    CPUTime
------------ ---------- ---------- ---------- ---------- ----------
123456      af_monomer   01:45:38        896   06:41:03 65-17:27:28
123456.bat+      batch   01:45:38        112   06:41:03 8-05:10:56

Use the last column (CPUTime) and multiply it by the estimated number of samples you are going to process in total. Now you've got the estimated (CPU_pw, for perfect world) CPU hours if everything went perfect, of course it never does (running out of time, process killed, file not found, but billed anyway). So you need to multiply CPU_pw with a safety factor (SF). I'd estimate one will at least run each pipeline twice, so say 2-10, if you need a lot of experimentation and optimization to get your pipeline to run or adapt to different input data.

Sum (CPUTime) over all pipeline types * # samples * SF * 0.01£ + eventual storage cost  = estimated cost (and round up to the next 1000£ or so)

I am guessing you will land at a few 1000 tops, still cheaper than buying a machine with the suitable specs.

Hope this helps.

2
Entering edit mode
6 months ago

For each nf-core pipeline, a full run with a regular test dataset is performed when a new pipeline version is released. You will find the results on the respective pipeline page in the "AWS results" tab. Locate the execution_report in the pipeline_info subfolder for a ballpark estimate.

For the RNA-seq pipeline, which I run quite often, I can give you some numbers from the back of my head: 300-400 CPU hours is the minimum. With RSEM instead of Salmon, it goes up to 700. Add another 200-400 when processing UMIs and when filtering rRNA contamination with SortMeRNA expect something like >2000 in total because that tool alone is dead slow. So a lot depends on the chosen parameters.

Also mind that you will likely have to rerun the pipeline multiple times, because something didn't work out as expected, so add a lofty overhead in your calculations.

2
Entering edit mode
6 months ago
Mensur Dlakic ★ 27k

I think it is impossible to estimate this accurately. The good news is that at a price of one cent per hour you can (and should) err on the side of overestimating it, and I would say by at least a factor of 2. The bad news is that your money will eventually run out and you may find yourself needing to repeat something, or do it several different ways.

You already got this suggestion, but I will repeat: it may be a good idea to buy your own server. This assumes that you have someone who can do basic computer setup and maintenance, which may be an extra expense. This is like a difference between paying a rent or getting a mortgage on the house. Renting is easier and may be cheaper because the landlord deals with logistic issues, but longer term buying a house is a better option.

1
Entering edit mode

I disagree with the comparison to a mortgage.

If I buy a house, rather than renting, then it is more hassle, and more short term expense, but I end up owning the house in perpetuity.

If I buy a server, its about the same up front capital cost as renting, and a lot more (expensive, if we are calculating full economic cost) hassle, and, because servers are only warrantied for 3 years, and supported for 5 years, at the end of that time, I don't own a functional server.

0
Entering edit mode

It wasn't meant to be a literal comparison, because no computer will be in use 20+ years like a house. The long-term for computers is not the same as for the houses.

As to this point:

because servers are only warrantied for 3 years, and supported for 5 years, at the end of that time, I don't own a functional server.

I don't know where you are buying your servers or how you are treating them, but I have a couple from 2015 that are still working without any problems. And another couple of beasts chugging happily despite being produced in 2012 - I bought them used in 2020.

0
Entering edit mode

The university here strongly discourages the use of computer hardware beyond their warrantee period, and won't sign off on a grant that was dependent on machines older than the "support" period.

1
Entering edit mode
6 months ago
dthorbur ★ 2.2k

There are a lot of variables to consider: complexity of experiments including number of samples, depth of coverage, size and complexity of reference genome/transcriptome, experience of users in streamlining data analysis, etc...

Also HPC specific details like read/write speeds, data delivery and upload, and disk use costs.

There are alternatives to using a university HPC, but if you are a member of that university, cloud computing services like GCP and AWS will likely be more expensive, though there are usually good offers to join. These services, however, can be complicated to get started on if you have little or no experience.

I suspect asking a colleague who works on these kinds of data at your institute might be the best way to get a realistic estimate.

1
Entering edit mode

An alternative is just to get your own server or desktop. No restrictions there ;)

0
Entering edit mode

No restrictions there ;)

You'll find yourself in a world of hurt ...

0
Entering edit mode

that doesn't scale, it is good for a few samples, but if the project involves large data sets, there is no option than using a local HPC

0
Entering edit mode

Considering I work in a "mostly" (>80%) computational lab that works on atlas-level consortium projects and all we use are our own servers, I'm going to go ahead and say that you're incorrect.

Of course, there are important considerations if you own your own server (see other responses), but to say it "doesn't scale" and "there is no option than using a local HPC" is completely false.

0
Entering edit mode

I think it's correct to say that it doesn't scale at least not in the same way as an HPC solution. That is what I mean with a world of hurt and why I assumed your comment of "no restrictions" is ironic. If you bought a server that can host 128GB RAM, and cannot be extended, then here is your limitation and it's a hard one. Other arguments for and against have already been mentioned.

0
Entering edit mode

I thnk its fair to say it doesn't scale (scaling probably isn't important for this OP, other issues, such as I note above are probably more important). Can you align reads, build a custom transcriptome annotation and quantify against that transcriptome on a personal server? Absolutely. Can you do that with 30 samples. Yes.

We just did that with 10,000 samples, at ~100 CPU hours per sample.

Try doing that on a personal server!

Thats what people mean by "doesn't scale".

As I say, probably not relevant for this OP.

0
Entering edit mode

I see, that's a fair point. Apologies for misunderstanding "scaling".

I haven't tried processing the entire TCGA RNAseq dataset on my personal server, but I am curious how it will turn out (a novaseq run will probably take ~10 cpu hours with kallisto/star/salmon gene expr quantification so I'd imagine it will be doable [ofc, other pipelines / different of analyses will be more computationally intensive]). I've tried a few smaller atlas projects though.

0
Entering edit mode

I don't have data at hand, only some gut feeling. I surmise that the cost of the experiments is going to overwhelm the cost of the HPC for regular RNAseq and ChIPseq.

Say it takes 1 hour to align 10M reads to the human genome with 8 cores at £0.01 (alignment is probably the most expensive step). That makes £0.08. If you have 100 libraries and 100M per library it costs £800. You need to add QC, trimming, downstream stuff, and analyses that need repeating (you used the wrong reference, parameters, etc). Let's make it £2000. I'd say the cost of 100 library preps and 100M reads/library of sequencing is way more than that?

If that's correct (big if) one doesn't need to get too involved in getting more accurate estimates or alternative solutions to the local HPC.