was about to write the same thing, you were 3 secs faster ;)
Question: Locations of plots of quantities of publicly available biological data |
||
6
|
There's a cliché in talks and presentations these days demonstrating the rapid (typically exponential, or super-exponential) growth of publicly available biological data of one nature or another (e.g., sequence data, yeast2hybrid, etc.). They're frequently juxtaposed against a plot of Moore's law. You know the type. You probably have even used or made such a plot if you're at this site. It's not always obvious where to find these plots. Surprisingly (disappointingly, even), major clearing houses for biological data such as GenBank and Gene Expression Omnibus (GEO) don't provide plots of their growth in any obvious location, let alone their front pages (where it makes the most sense to display such positive trends). Let's compile a list of where to find these plots, including, but not limited to:
|
|
6
|
We started this the other day. See this thread: http://biostar.stackexchange.com/questions/2966/exponentially-increasing-genomes-slide Another one I like that hasn't come up yet is the growth of GeneTests, disease for which testing is available: http://www.ncbi.nlm.nih.gov/projects/GeneTests/static/whatsnew/labdirgrowth.shtml |
|
|
1
Thanks. I failed in picking my search terms to look for an existing question. I don't know if we should close this question as a duplicate, as I'm interested in any type of (high-throughput) biological data. | ||
5
|
Data for the growth of the number of articles in MEDLINE can be found here: http://www.nlm.nih.gov/bsd/licensee/baselinestats.html There is some time lag in interpreting numbers from the MEDLINE baseline files. For example, good data on the growth of MEDLINE through 2008 can be found in the 2010 baseline statistics: http://www.nlm.nih.gov/bsd/licensee/2010_stats/2010_Totals.html EDIT 1: Data for the growth of the number of GeneRIFs in Entrez Gene can be found here: http://www.ncbi.nlm.nih.gov/projects/GeneRIF/stats/ EDIT 2: Data for the growth of the number of GWAS studies in the Human Genome Epidemiology database: http://hugenavigator.net/HuGENavigator/startPageWatch.do |
|
|
| ||
5
|
Already added sequence data growth in Uniprot in the other question, As you are interested in various data categories - here is the exponential growth of RCSB-PDB from 70's - till date. Kudos to RCSB-PDB team for providing the data and the graph in a convenient way. |
|
|
| ||
4
|
You might also want to take a look at this: edit: there are some issues with the paper, see Lars' blogpost. |
|
|
| ||
4
|
Just a brief note on a way to generate "growth of database" data yourself, at least for the Entrez databases. Most of the Bio* projects include an EUtils library. The BioRuby module has a useful method, esearch_count, which counts the number of results for a query. As an example, you could retrieve total publications in PubMed for years 2000-2010 like this:
Redirect the output to create a tab-delimited file with year + count. Here, we're searching the DP (date published) field in PubMed. You could substitute any Entrez database, search term(s) and years. |
|
|
| ||
3
|
The Silva website plots the growth of ribosomal RNA databases. e.g. http://www.arb-silva.de/documentation/background/release-104/ |
|
|
| ||
3
|
SCOP has listed out the statistics of it's release history in tabular form from last 12 years. Scop Classification Statistics I agree with Khader that PDB has done excellent job to report the statistics on it's entries. They have something called histogram menu which can easily generate statistics on current entries based on various criterion. ex: Source Organism (Gene Source) Histogram |
|
|
| ||
3
|
There is a news article from October 2010 in Science that has a plot of the growth of human SNP data, particularly with regards to the 1000 Genomes project. |
|
3
|
A recent paper with an updated "Growth of GEO" plot:
|
|
|
| ||
Good to see you here!
I think it would also be interesting to post code that can generate these plots. The data are often available, although often not in the best format, for those who'd like to try a roll-your-own approach.