Forum:What is the amount of sequencing data produced annually?
2
4
Entering edit mode
14 months ago
vincenthus ▴ 70

I'm trying to figure out the current annual volume of sequencing data produced either in bytes or basepairs as there doesn't seem to be up to date information about it.

Even now in 2023 most references point back to the 2015 paper titled “Big Data: Astronomical or Genomical?” which projected that by 2025, the volume of sequenced genomic data would fall in the range of 2-40 exabytes.

However, relying on an 8-year-old projection in 2023 is not entirely fair.

The 40 exabytes projection was based on a historical growth rate in sequencing data volume doubling every 7 months (approx. 3.3x annually) - a trend that held true up to 2015.

Additionally, the paper cited the size of the Sequence Read Archive (SRA) as an example. It stated that the SRA held 3.6 petabytes of data in 2015, with the top 20 institutes storing 100 petabytes in total.

Interestingly, an AWS blog noted that this had grown to 36 petabytes by the end of 2020. That represents a 10-fold increase over roughly six years, which equates to an annual growth rate of about 47% far from the historical 3.3x annual increase as projected in the paper.

With this context, I'm curious: does anyone have insights or more recent data on the current volume of genomic data sequencing?

Do the figures of 2 to 40 exabytes sound reasonable?

Thank you in advance for any input or guidance!

sequencing volume

sequencing vs other tech domains

data research sequencing • 3.9k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode

Illumina estimates 4M humans have been sequenced so far - I suspect this means WGS since Regeneron alone has 1M exomes (idk there might be 10M exomes total?) and UKBB+AllofUs=1.5M WGS - which seems consistent with that figure https://www.illumina.com/company/news-center/feature-articles/25-greatest-impacts-in-25-years--a-look-back-at-illumina-and-the.html

ADD REPLY
0
Entering edit mode

So to clarify:

  • There are 10M exomes total sequences by Illumna and others?

With:

  • Illumna being 4M humans of that?
  • Of Ilumna = 1.5m WGS UK Biobank + AllofUs
ADD REPLY
1
Entering edit mode

I think there are at least 4M WGS sequences. I have no idea how many exome sequences but probably some multiple of that.

ADD REPLY
3
Entering edit mode
14 months ago
GenoMax 148k

NCBI provides a reasonably updated report of the data that is in SRA: https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/

Do the figures of 2 to 40 exabytes sound reasonable?

As for a number that you are looking for you could pick one and it would likely not be correct. A significant amount of sequencing capacity is likely in hands of for-profit industry and we would probably never know how much data they produce per year.

ADD COMMENT
0
Entering edit mode

Thank you for sharing.

So from the graph I can see that from 2017 to end of 2022 (6 years) there has been a 7x increase. Or about 7^(1/6) = 38% annual increase on the SRA.

ADD REPLY
1
Entering edit mode

Theoretical sequencing capacity available worldwide (for all sequencing technologies) must be mind boggling. Not all of the capacity is in use producing data at all time. We probably see only a fraction of that showing up in SRA. Trick is to estimate what fraction that would be. If you conservatively say 50% or less then that will allow you to come up with a number.

ADD REPLY
0
Entering edit mode

Here's a more neat version of the plot generated while writing my dissertation:

enter image description here

ADD REPLY
1
Entering edit mode
14 months ago
Zhenyu Zhang ★ 1.2k

I think a good estimation will be asking Illumina how many different flowcells they have sold, multiply some discount factor.

ADD COMMENT
2
Entering edit mode

about 65% of ILMN's revenue is reagents aka sequencing consumables, or about $3B/yr, and a 30X WGS takes maybe $500 in reagents on average across all NovaSeqs. That would mean 6M WGS a year if everything was spent on human WGS (instead of RNA and other organisms) https://docs.google.com/spreadsheets/d/1GMMfhyLK0-q8XkIo3YxlWaZA5vVMuhU1kg41g4xLkXc/edit#gid=1569422585 https://s24.q4cdn.com/526396163/files/doc_financials/2022/q2/2Q22-Summary-of-Prepared-Remarks.pdf

ADD REPLY
1
Entering edit mode

Thank you so much! Looking at the growth in sequencing consumables is probably the best approach.

Wow that google sheets file is a gold mine.

ADD REPLY
1
Entering edit mode

Illumina is not the only player in the market. PromethION can generate a lot more data if one has the budget for consumables. Unless you are only focused on Illumina there will be many additional options coming in next year.

ADD REPLY
1
Entering edit mode

I see at the bottom of that spreadsheet it estimates if every sequencer sold was running full time they could generate 38M WGS a year

ADD REPLY
2
Entering edit mode

Yes I also noticed the figure of 38M – such a great source.

If we were to assume all instruments were WGS-based, the operational ratio would be 6M/38M, which translates to 16%. It might be more realistic, however, to surmise that the maximum lifespan of these instruments is around 5 years, implying that not all of them remain in active use.

Furthermore, given that approximately 80% of Illumina's revenue is derived from consumables and the remaining 20% from the instruments themselves, this can be used as a metric to approximate the annual sequencing data output.

Breaking it down:

  • Illumina’s annual product revenue totals $4.1 billion.
  • 69% of the global installed output capacity is attributed to the LMN NovaSeq S4.
  • The LMN NovaSeq S4 is priced at a minimum of $4.84/Gb.

From the data above, we can deduce:

  • Annual sequencing stands at approximately 847M Gb per year, calculated as $4.1 billion divided by $4.84/Gb.
  • This translates to roughly 9.4M WGS of human genomes annually, given that one human genome is equivalent to 90M Gb.

I’ve made some rough assumptions, but it not too far off your number of 6M WGS (consumables spend /$500 per WGS).

But as Illumina mentioned that in total there have been 4M genomes sequenced, and 10M WES it is not completely correct. I don't know how to think about that though.

What's intriguing to hypothesize about:

  • Assuming that over the next 5 years, the average price of the globally installed capacity drops to $1/Gb.
  • This rate aligns with the current competitive pricing seen from MGI and others, a benchmark Illumina would likely need to meet. Therefore, we could expect the average installed price to go towards $1/Gb.
  • Should Illumina maintain its annual revenue consistent with 2023 figures, this would signify a surge in sequencing data output by a factor of 4.8x
  • Consequently, this translates to an exponential growth rate of 37% annually in sequencing output over the upcoming five years, proportional to the decline in sequencing expenses.
ADD REPLY
1
Entering edit mode

I would imagine at least half of the reagents are used in transcriptomics (RNA), so the WGS count is more of a loose heuristic to gauge how many subjects are being sequenced

ADD REPLY

Login before adding your answer.

Traffic: 1644 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6