Question: How Much Does It Cost To Align A Flowcell In The Cloud?
10
gravatar for Jeremy Leipzig
5.3 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

How much does it cost to align a 8 lanes of a HiSeq 2000 flowcell paired-end run consisting of 6 billion 100bp human genomic reads (600Gbp output) using the Amazon EC2 and associated storage?

Assume using an aligner such as BWA with default parameters, aligned against a human reference. Please compute the cost and time from upload of FASTQ files to download of BAM files inclusive.

"It depends" is not a useful answer - if there is a factor such as instance-type or EBS vs S3 please select a sensible option and provide a quote.

cloud • 3.4k views
ADD COMMENTlink modified 28 days ago • written 5.3 years ago by Jeremy Leipzig17k

Is it intentional that this question looks like a homework assignment?

ADD REPLYlink written 5.3 years ago by Leonor Palmeira3.6k
4

Remember to show your work for partial credit

ADD REPLYlink written 5.3 years ago by Jeremy Leipzig17k
1

Is this supposed to be a sarcastic answer?

ADD REPLYlink written 5.3 years ago by Leonor Palmeira3.6k
4
gravatar for Jeremy Leipzig
4.8 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

Konrad Karczewski of stormseq said in a recent tweet that "a full genome (30X coverage) is about $30 or so, exome (80X coverage) is around $2-3." I think this is from fastq->vcf using BWA and GATK.

Assuming an 80X exome is about 40M reads, a $2.50 exome comes out to about $6.25/hmmvr (hundred million mapped variant reads - my awesome new metric), or about $375 for a HiSeq 2k flowcell. Granted this is going all the way to VCF but that number is in line with what I've heard elsewhere.

I think the BWA->GATK pipeline is commoditized enough that providers should be using these $/hmmvr on scienceexchange. Then labs will have something to compare it with, if they don't want to try AWS themselves.

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Jeremy Leipzig17k

intriguing stuff - I am going to try to verify this in practice

ADD REPLYlink written 4.8 years ago by Istvan Albert ♦♦ 74k

http://schatzlab.cshl.edu/presentations/2013-03-18.NYGC.AWS.CloudScaleSolutions.pdf says 3.3B reads for $97.69, equivalent to $2.96/hmmvr

ADD REPLYlink written 4.7 years ago by Jeremy Leipzig17k

I heard the haplotype-based GATK is so compute expensive that the price is like $16/hmmvr

ADD REPLYlink written 2.2 years ago by Jeremy Leipzig17k
3
gravatar for brendan.d.gallagher
15 months ago by
United States
brendan.d.gallagher30 wrote:

Talking single samples from fastq>vcf, retail pricing on an AWS c3.8xlarge, and running BWA/GATK based pipelines (not including upload/download just runtime, sorry)

  • ~$10 for a 30x 120Gbp whole genome
  • ~1.68 for a 15Gbp exome (typically finishes in under an hour) e.g. SRR098401 takes ~45 mins from fastq-vcf on one c3.8xlarge
  • ~$50 for a 600 Gbp flowcell but that answer does depend on the samples

This is done using tools that produces the same results to BWA/GATK (with haplotype caller) but more efficiently and deterministically from www.Sentieon.com

The tools are easy to install too, email me at brendan.gallagher@sentieon.com to use these tools

ADD COMMENTlink written 15 months ago by brendan.d.gallagher30
2
gravatar for Lee Katz
5.3 years ago by
Lee Katz2.8k
Atlanta, GA
Lee Katz2.8k wrote:

I believe that Florian Fricke calculated that exact thing, with relation to the Amazon cloud. He gave a presentation and showed the comparison, although I also think it is in one of his publications.

ADD COMMENTlink written 5.3 years ago by Lee Katz2.8k
5

Here it is http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026624

Results We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers.

Conclusions Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.

ADD REPLYlink written 5.3 years ago by Lee Katz2.8k

ah 16S, but still a useful metric. thanks!

ADD REPLYlink written 5.3 years ago by Jeremy Leipzig17k
1
gravatar for Antonio R. Franco
22 months ago by
Spain. Universidad de Córdoba
Antonio R. Franco3.4k wrote:

Would not be nice to describe the type of EC2 Instance used in these calculations ?

ADD COMMENTlink written 22 months ago by Antonio R. Franco3.4k
0
gravatar for Jeremy Leipzig
22 months ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

Scalable and cost-effective NGS genotyping in the cloud
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4608296/

Souilmi reports:

A high-coverage (~150×) exome from alignment to variant calling runtime was 136 min for a total cost of $23 (download and backup to AWS S3 storage cost $5).

A single genome (42× coverage) analysis was 13 h 52 mins for a total cost of ~ $109

This is using the GATK Haplotype caller

ADD COMMENTlink modified 22 months ago • written 22 months ago by Jeremy Leipzig17k
  1. When you say exome, can you please clarify the details of the exome? Is it a 50 Mb exome?
  2. My calculation assumptions for 50 Mb exome at 150X coverage is 50 Mb X 150 X 2 (50% on target dues to coverage on introns) = 15 Gb output. Would the compute costs of $23 be applicable here?
  3. Are costs approximately linear with output size in Gb? Say if I perform a small capture 1/4th the size of an exome, is the AWS cost $23/4 = ~$5.75
ADD REPLYlink written 17 months ago by New2R10

I am not the author above, but I would venture to say

  1. yes

  2. i had not really considered the on-target question - you're probably right there

  3. the claimed costs are almost never linear (or fair) because people throw out low numbers assuming you are not doing single-sample calling but instead huge batches where various EC2 instances can be efficiently managed. On phone conferences I have heard people now claim $5 or $3 150X exomes (fastq->vcf), I would love to see that in print.

ADD REPLYlink modified 17 months ago • written 17 months ago by Jeremy Leipzig17k
0
gravatar for Jeremy Leipzig
28 days ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

https://www.prnewswire.com/news-releases/childrens-hospital-of-philadelphia-and-edico-genome-achieve-fastest-ever-analysis-of-1000-genomes-300540026.html

ADD COMMENTlink modified 28 days ago • written 28 days ago by Jeremy Leipzig17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1508 users visited in the last hour