Question: Locations of plots of quantities of publicly available biological data
 
6
 
 

There's a cliché in talks and presentations these days demonstrating the rapid (typically exponential, or super-exponential) growth of publicly available biological data of one nature or another (e.g., sequence data, yeast2hybrid, etc.). They're frequently juxtaposed against a plot of Moore's law. You know the type. You probably have even used or made such a plot if you're at this site.

It's not always obvious where to find these plots. Surprisingly (disappointingly, even), major clearing houses for biological data such as GenBank and Gene Expression Omnibus (GEO) don't provide plots of their growth in any obvious location, let alone their front pages (where it makes the most sense to display such positive trends). Let's compile a list of where to find these plots, including, but not limited to:

  • Publications (decent)
  • Open-access publications (good)
  • Sites that provide up-to-date plots (better)
  • Scripts or programs that generate plots on the fly (excellent)
 
 
 
2

Good to see you here!

log in to reply • written 19 months ago by Paulo Nuin  3351614
 
3

I think it would also be interesting to post code that can generate these plots. The data are often available, although often not in the best format, for those who'd like to try a roll-your-own approach.

log in to reply • written 19 months ago by Neilfws ♦♦ 286011949

9 answers

 
6
 
 

We started this the other day. See this thread:

http://biostar.stackexchange.com/questions/2966/exponentially-increasing-genomes-slide

Another one I like that hasn't come up yet is the growth of GeneTests, disease for which testing is available:

http://www.ncbi.nlm.nih.gov/projects/GeneTests/static/whatsnew/labdirgrowth.shtml

 
 
 
1

was about to write the same thing, you were 3 secs faster ;)

log in to reply • written 19 months ago by Michael Schubert  5231515
 
1

Thanks. I failed in picking my search terms to look for an existing question. I don't know if we should close this question as a duplicate, as I'm interested in any type of (high-throughput) biological data.

log in to reply • written 19 months ago by Gotgenes  328
 

then you may want to refine your question in order to not be a duplicate ;)

log in to reply • written 19 months ago by Michael Schubert  5231515
 
 
5
 
 

Data for the growth of the number of articles in MEDLINE can be found here:

http://www.nlm.nih.gov/bsd/licensee/baselinestats.html

There is some time lag in interpreting numbers from the MEDLINE baseline files. For example, good data on the growth of MEDLINE through 2008 can be found in the 2010 baseline statistics: http://www.nlm.nih.gov/bsd/licensee/2010_stats/2010_Totals.html

EDIT 1: Data for the growth of the number of GeneRIFs in Entrez Gene can be found here:

http://www.ncbi.nlm.nih.gov/projects/GeneRIF/stats/

EDIT 2: Data for the growth of the number of GWAS studies in the Human Genome Epidemiology database:

http://hugenavigator.net/HuGENavigator/startPageWatch.do

 
 
 
 
5
 
 

Already added sequence data growth in Uniprot in the other question, As you are interested in various data categories - here is the exponential growth of RCSB-PDB from 70's - till date. Kudos to RCSB-PDB team for providing the data and the graph in a convenient way.

 
 
 
 
4
 
 
 
 
 
 
4
 
 

Just a brief note on a way to generate "growth of database" data yourself, at least for the Entrez databases.

Most of the Bio* projects include an EUtils library. The BioRuby module has a useful method, esearch_count, which counts the number of results for a query. As an example, you could retrieve total publications in PubMed for years 2000-2010 like this:

#!/usr/bin/ruby
require "rubygems"
require "bio"

Bio::NCBI.default_email = "me@me.com"
ncbi = Bio::NCBI::REST.new

2000.upto(2010) do |year|
  all   = ncbi.esearch_count("#{year}[dp]", {"db" => "pubmed"})
  puts "#{year}\t#{all}"
end

Redirect the output to create a tab-delimited file with year + count. Here, we're searching the DP (date published) field in PubMed. You could substitute any Entrez database, search term(s) and years.

 
 
 
 
3
 
 

The Silva website plots the growth of ribosomal RNA databases.

e.g. http://www.arb-silva.de/documentation/background/release-104/

 
 
 
 
3
 
 

SCOP has listed out the statistics of it's release history in tabular form from last 12 years.

Scop Classification Statistics

I agree with Khader that PDB has done excellent job to report the statistics on it's entries. They have something called histogram menu which can easily generate statistics on current entries based on various criterion.

ex: Source Organism (Gene Source) Histogram

 
 
 
 
3
 
 

There is a news article from October 2010 in Science that has a plot of the growth of human SNP data, particularly with regards to the 1000 Genomes project.

 
 
 

Bump! Not an OA article.

log in to reply • written 18 months ago by Khader Shameer  119711028
 
 
3
 
 

A recent paper with an updated "Growth of GEO" plot:

Le et al. Cross-species queries of large gene expression databases. Bioinformatics (2010) vol. 26 (19) pp. 2416-23

 
 
 
Log in to add a post