I'm trying to figure out the current annual volume of sequencing data produced either in bytes or basepairs as there doesn't seem to be up to date information about it.
Even now in 2023 most references point back to the 2015 paper titled “Big Data: Astronomical or Genomical?” which projected that by 2025, the volume of sequenced genomic data would fall in the range of 2-40 exabytes.
However, relying on an 8-year-old projection in 2023 is not entirely fair.
The 40 exabytes projection was based on a historical growth rate in sequencing data volume doubling every 7 months (approx. 3.3x annually) - a trend that held true up to 2015.
Additionally, the paper cited the size of the Sequence Read Archive (SRA) as an example. It stated that the SRA held 3.6 petabytes of data in 2015, with the top 20 institutes storing 100 petabytes in total.
Interestingly, an AWS blog noted that this had grown to 36 petabytes by the end of 2020. That represents a 10-fold increase over roughly six years, which equates to an annual growth rate of about 47% far from the historical 3.3x annual increase as projected in the paper.
With this context, I'm curious: does anyone have insights or more recent data on the current volume of genomic data sequencing?
Do the figures of 2 to 40 exabytes sound reasonable?
Thank you in advance for any input or guidance!
Illumina blog also citing this paper which seems unsubstantiated:
https://www.illumina.com/company/news-center/blog/solving-for-the-information-gap-in-genomics-breakthroughs.html
Illumina estimates 4M humans have been sequenced so far - I suspect this means WGS since Regeneron alone has 1M exomes (idk there might be 10M exomes total?) and UKBB+AllofUs=1.5M WGS - which seems consistent with that figure https://www.illumina.com/company/news-center/feature-articles/25-greatest-impacts-in-25-years--a-look-back-at-illumina-and-the.html
So to clarify:
With:
I think there are at least 4M WGS sequences. I have no idea how many exome sequences but probably some multiple of that.