How Much Does It Cost To Align A Flowcell In The Cloud?
7
12
Entering edit mode
10.6 years ago

How much does it cost to align a 8 lanes of a HiSeq 2000 flowcell paired-end run consisting of 6 billion 100bp human genomic reads (600Gbp output) using the Amazon EC2 and associated storage?

Assume using an aligner such as BWA with default parameters, aligned against a human reference. Please compute the cost and time from upload of FASTQ files to download of BAM files inclusive.

"It depends" is not a useful answer - if there is a factor such as instance-type or EBS vs S3 please select a sensible option and provide a quote.

cloud • 8.9k views
0
Entering edit mode

Is it intentional that this question looks like a homework assignment?

4
Entering edit mode

Remember to show your work for partial credit

1
Entering edit mode

Is this supposed to be a sarcastic answer?

0
Entering edit mode

"Broad has reduced the cost of processing on the cloud from about $45 per genome when the cloud move started to$5 now, and has a target of $3 based on some work currently in progress, Mr. Mayo said." https://blogs.wsj.com/cio/2018/03/12/harvard-mits-broad-institute-powers-genomic-research-in-the-cloud/ ADD REPLY 0 Entering edit mode Is that pricing available for everyone or only to those who are at the level that Broad does business at? ADD REPLY 0 Entering edit mode yes i suspect there are some efficiencies of scale and also negotiated rates ADD REPLY 2 Entering edit mode Nope that’s the cost for anyone who uses our pipeline on Google, for a single whole genome, going from unmapped reads to GVCF or VCF, including QC. Nothing to do with scale or pref pricing, except that we benefitted from GCP engineers’ help to optimize the pipeline. Check it out here ADD REPLY 0 Entering edit mode that's amazing! good work ADD REPLY 0 Entering edit mode ADD REPLY 5 Entering edit mode 10.2 years ago Konrad Karczewski of stormseq said in a recent tweet that "a full genome (30X coverage) is about$30 or so, exome (80X coverage) is around $2-3." I think this is from fastq->vcf using BWA and GATK. Assuming an 80X exome is about 40M reads, a$2.50 exome comes out to about $6.25/hmmvr (hundred million mapped variant reads - my awesome new metric), or about$375 for a HiSeq 2k flowcell. Granted this is going all the way to VCF but that number is in line with what I've heard elsewhere.

I think the BWA->GATK pipeline is commoditized enough that providers should be using these $/hmmvr on scienceexchange. Then labs will have something to compare it with, if they don't want to try AWS themselves. ADD COMMENT 0 Entering edit mode intriguing stuff - I am going to try to verify this in practice ADD REPLY 0 Entering edit mode http://schatzlab.cshl.edu/presentations/2013-03-18.NYGC.AWS.CloudScaleSolutions.pdf says 3.3B reads for$97.69, equivalent to $2.96/hmmvr ADD REPLY 0 Entering edit mode I heard the haplotype-based GATK is so compute expensive that the price is like$16/hmmvr

3
Entering edit mode
10.6 years ago
Lee Katz ★ 3.1k

I believe that Florian Fricke calculated that exact thing, with relation to the Amazon cloud. He gave a presentation and showed the comparison, although I also think it is in one of his publications.

5
Entering edit mode

Results We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Conclusions Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers. ADD REPLY 0 Entering edit mode ah 16S, but still a useful metric. thanks! ADD REPLY 3 Entering edit mode 6.6 years ago Talking single samples from fastq>vcf, retail pricing on an AWS c3.8xlarge, and running BWA/GATK based pipelines (not including upload/download just runtime, sorry) • ~$10 for a 30x 120Gbp whole genome
• ~1.68 for a 15Gbp exome (typically finishes in under an hour) e.g. SRR098401 takes ~45 mins from fastq-vcf on one c3.8xlarge
• ~$50 for a 600 Gbp flowcell but that answer does depend on the samples This is done using tools that produces the same results to BWA/GATK (with haplotype caller) but more efficiently and deterministically from www.Sentieon.com The tools are easy to install too, email me at brendan.gallagher@sentieon.com to use these tools ADD COMMENT 1 Entering edit mode 7.2 years ago Would not be nice to describe the type of EC2 Instance used in these calculations ? ADD COMMENT 1 Entering edit mode 3.9 years ago You didn't actually ask about GATK or variant calling, but I see that in a lot of answers (and that is part of what you would probably want to do on the cloud). I didn't run BWA-MEM on the cloud, but here are some thoughts: 1) I think the best solution depends upon the total number of samples. Right now, I believe you can get a$300 credit to test Google Cloud. So, if you only plan to process a limited number of samples that can essentially be free.

2) For some people, I think precisionFDA may be an acceptable solution (I believe they already have apps for BWA-MEM alignnment and GATK variant calling). If you are willing to contribute data (I created an account with a G-mail address), analysis through DNAnexus will be free.

I have some notes about my experience with precisionFDA in this blog post.

3) If you don't use precisionFDA (and instead use AWS or Google Cloud, post-credit), my limited testing gives me the impression that the costs may be a little higher than you may expect. For example, I spent a few hundred dollars getting used to AWS (although that may still have been less than a formal class), and I have some notes about running DeepVariant on Google Cloud (and AWS) here.

However, the long-term costs are something other users may admittedly be able to better to answer than myself.

I hope this helps!

1
Entering edit mode

Speaking on behalf of the GATK team, we're now making all of our pipelines available on Google cloud through a platform developed at Broad called Terra. Terra is freely accessible, with compute & storage billed directly by Google. The pipelines are fully set up in workspaces that include test data and cost+runtime estimates. As I mentioned in a thread above, we worked with Google engineers to optimize for cost, so at this point it's probably one of the cheapest options (unless you have a spare HPC that you don't pay to use of course).

For example the pipeline we use in production for pre-processing (from unmapped reads) and single-sample variant calling with HaplotypeCaller (to GVCF or VCF) costs ~$5 to run per 30x WGS sample. You can check it out here: https://app.terra.bio/#workspaces/help-gatk/five-dollar-genome-analysis-pipeline Considering you get a$300 google credit when you sign up for Terra (which might be cumulative w/ the basic intro credit, not sure) you can indeed get a lot of work done without paying a cent. You can also bring your own tools & pipelines, btw, it's not restricted to GATK or Broad tools. There's a Terra showcase that presents a variety of preloaded analyses from various groups, if you want to check that out: https://app.terra.bio/#library/showcase

FYI this blog post describes how running pipelines on Terra works if you want to get a sense of that first: https://software.broadinstitute.org/gatk/blog?id=24139

And this post shows how to run individual GATK commands on cloud in jupyter notebooks: https://software.broadinstitute.org/gatk/blog?id=24175

0
Entering edit mode

Thank you for letting me know about Terra. For my purposes, my conclusion was that local analysis is probably preferable for what I need to do, although this is good information for this discussion (and, if I get new data, I do plan on adding that to precisionFDA).

My previous tests were for DeepVariant, but I the computational requirements to be considerably less for GATK. So, while the run time would be longer on a computer with 4 cores and 8 GB of RAM, I could successfully run GATK (in addition to having filtering options that I preferred over DeepVariant).

I mention this because I thought perhaps I should copy of the portion of the DeepVariant discussion group:

Also, I think I understand better what you were saying before:

The current quote is "On a 64-core CPU-only machine, DeepVariant completes a 50x WGS in 5 hours".

I have ~25X WGS, so divide that by 2 (2.5 hours).

However, I was using 8 cores instead of 64 cores, so multiple by 8 (20 hours).

This matches the AWS run-time (~18 hours), but with an on-demand instance (without additional cost savings).

My mistake is that I overlooked "On a 64-core CPU-only machine." I apologize that it took me a little while to realize this, but I think this discussion may benefit other users (who probably don't have that on their local machine, or may want to decrease costs and not use that many cores on the cloud).

On the cost side, the tutorial says "preemptible VMs, which are up to 80% cheaper than regular VMs...not covered by any Service Level Agreement (SLA), so if you require guarantees on turnaround time, do not use the --preemptible flag." So, if my cost without the preemptible VMs is ~$10, 80% of that does match the$2-3 estimated minimum preemptible cost.


That being said, if the computational requirements are less, I think costing $5 versus$10 sounds reasonable (and I expect there are ways to run GATK that would cost even less, if you ran it like the regular computer).

DeepVariant users need to be aware those numbers were for a 64-core CPU-only machine and $2-3 was the estimated minimum preemptible cost (which, if 80% the regular cost, matches my experience of being$10-15).

1
Entering edit mode

That all makes sense to me. Right, the cost optimizations we use include preemptible instances, which make a huge difference. There are definitely some caveats around using those.

We have a version of the pipeline that's actually down to ~3 and is faster, in part due to a new GATK4-native implementation of MarkDuplicates that uses Spark for multithreading. I don't think it's been formally released yet but it shouldn't be long. In general we don't expect everyone to switch to google cloud for full-time work; it will make sense for some people but not for others. And to be frank it doesn't really matter to us. We're mostly focused on making sure people can access and test the pipelines with minimal effort, and having the preloaded workspaces available in Terra means we can point you to actual working pipelines and say this is the latest Best Practices, here's how we run them, here are the resource files and the exact command line arguments; and you can go and test them on your data without having to install anything. We just hope it removes some of the friction people face when starting out or trying to update to a new version. Next step is to actually start publishing benchmarks! ADD REPLY 0 Entering edit mode Ok - maybe I am reading too much into what you are describing as a "benchmark," but I think explanations in limitations for certain ways of presenting data (and emphasizing why certain methods may work better in certain situations) is important. For RNA-Seq, I think there needs to be some testing of methods for every project (and one person's benchmark may not represent the best option for your own data), and I think there is still some room to critically assess variant calls (particularly if they inform some sort of major decision). However, if that is at all helpful, I have those Exome-vs-WGS precisionFDA comparisons in this blog post. ADD REPLY 0 Entering edit mode 7.2 years ago Scalable and cost-effective NGS genotyping in the cloud http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4608296/ Souilmi reports: A high-coverage (~150×) exome from alignment to variant calling runtime was 136 min for a total cost of23 (download and backup to AWS S3 storage cost $5). A single genome (42× coverage) analysis was 13 h 52 mins for a total cost of ~$109

This is using the GATK Haplotype caller

0
Entering edit mode
1. When you say exome, can you please clarify the details of the exome? Is it a 50 Mb exome?
2. My calculation assumptions for 50 Mb exome at 150X coverage is 50 Mb X 150 X 2 (50% on target dues to coverage on introns) = 15 Gb output. Would the compute costs of $23 be applicable here? 3. Are costs approximately linear with output size in Gb? Say if I perform a small capture 1/4th the size of an exome, is the AWS cost$23/4 = ~$5.75 ADD REPLY 0 Entering edit mode I am not the author above, but I would venture to say 1. yes 2. i had not really considered the on-target question - you're probably right there 3. the claimed costs are almost never linear (or fair) because people throw out low numbers assuming you are not doing single-sample calling but instead huge batches where various EC2 instances can be efficiently managed. On phone conferences I have heard people now claim$5 or \$3 150X exomes (fastq->vcf), I would love to see that in print.

0
Entering edit mode
5.4 years ago