A few years back, I asked a dozen or so colleagues for classic/important papers that every bioinformatician should read as a part of their training. I thought BioStar might be a good place to resuscitate this exercise to get a broader set of candidates and let the community weigh in on what papers make up the bioinformatics "canon".
Here are some of the papers that I use for teaching to start the ball rolling:
Altschul et al. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. http://www.ncbi.nlm.nih.gov/pubmed/2231712
Myers et al. A whole-genome assembly of Drosophila. Science. 2000 Mar 24;287(5461):2196-204. http://www.ncbi.nlm.nih.gov/pubmed/10731133
Burge & Karlin. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997 Apr 25;268(1):78-94. http://www.ncbi.nlm.nih.gov/pubmed/9149143
Lowe & Eddy. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997 Mar 1;25(5):955-64. http://www.ncbi.nlm.nih.gov/pubmed/9023104
Depending on the level of interest in this topic, perhaps we can put together a library on citeulike of "bioinformatics classics"
Usually, these papers were classified into the bionformatics' fields of research during the 1990s, i.e., gene prediction (genscan, glimmer, etc), alignment (blast, Smith-Waterman, Needleman-Wunsch, etc), protein structure prediction (Chou-Fasman, etc), and phylogenetics (phylip, etc).
Here's a short list of alignment- related articles, in addition to the already listed Smith-Waterman and Needleman-Wunsch papers:
Besides, the famous articles from Margaret Dayhoff about substitution matrices:
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 345-352, Natl. Biomed. Res. Found., Washington, DC.
Schwartz, R.M., Dayhoff, M.O. (1978) "Matrices for detecting distant relationships." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 353-358, Natl. Biomed. Res. Found., Washington, DC.
Oh, I did a blog post on one once. It was part of a "classic papers" blogging initiative that was really fun, actually.
In it I think I found the first computational protein analysis:
In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.
The program was called COMPROTEIN (yes, it was all caps). But it was in fact a pipeline of several programs: MAXLAP, MERGE, PEPT , SEARCH, QLIST, and LOGRED.
Reference: Dayhoff, M. O. and R. S. Ledley. Comprotein: A Computer Program to Aid Primary Protein Structure Determination. In Proceedings of the Fall Joint Computer Conference, 1962, 262-274. Santa Monica, CA: American Federation of Information Processing Societies, 1962. http://doi.acm.org/10.1145/1461518.1461546
The link is now broken though, I'll have to find out where it is now.
This link seems to work: http://portal.acm.org/citation.cfm?id=1461546
Nobody cited the Smith & Waterman algorithm ?
JMB 1981: Identification of common molecular subsequences T. F. Smith and M. S. Waterman http://dx.doi.org/10.1016/0022-2836(81)90087-5
Needleman, Saul B.; and Wunsch, Christian D. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443–53. doi:10.1016/0022-2836(70)90057-4. PMID 5420325.
I wouldn't normally answer a question twice, but these are unrelated to my first answer.
Important papers to me personally:
Chothia C, Lesk AM. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J. 1986 Apr;5(4):823-6. http://www.ncbi.nlm.nih.gov/pubmed/3709526
Paving the way for homology modelling.
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999 Dec;20(18):3551-67. http://www.ncbi.nlm.nih.gov/pubmed/10612281
The paper that outlined MASCOT, as important as BLAST for proteomics (though SEQUEST came earlier - Eng et al. (1994) J Am Soc Mass Spectrom 5: 976–989. doi:10.1016/1044-0305(94)80016-2).
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002 Oct;12(10):1611-8. http://www.ncbi.nlm.nih.gov/pubmed/12368254
Other Bio* library papers are available, but I think most would agree, BioPerl is the most "important".
Maybe the paper on the 1000 genomes published yesterday will open a new era in bioinformatics.
This morning I attended a talk from one of the authors, and he explained some of the challenges that have been faced by the 1000 genomes consortium. For the first time in history, the biggest datasets in biology are reaching the levels of the datasets in physics and astronomy. From now on, we will have to think more carefully about the tools we use: for example, physicists have developed an alternative to Internet to share data, while we biologists are still using the http or ftp protocol to download data, competing with people downloading mp3s. We need to look for alternatives to download Gigabytes of new data produced daily, like shared cloud computing images for example. Moreover, the 1000 genomes project has also presented many new formats like BAM and SAM, and new tools to handle huge datasets.
I would also add the first COG paper to the list:
It offers interesting evolutionary insights and the concept of COG is a quite helpful tool - personally speaking.
PLoS COmputational Biology has recently launched a series of Perspectives called 'The roots of bioinformatics', to illustrate the seminal papers in each of the sub-fields in bioinformatics.
To date, only two articles of the series have been published:
Searls DB. The roots of bioinformatics. PLoS Comput Biol. 2010 Jun . Doolittle RF. The roots of bioinformatics in protein evolution. PLoS Comput Biol. 2010 Jul 29;6(7):e1000875. Review. PubMed PMID: 20686682;
Doolittle RF. The roots of bioinformatics in protein evolution. PLoS Comput Biol. 2010 Jul 29;6(7):e1000875. Review. PubMed PMID: 20686682; PubMed Central PMCID: PMC2912333.
If you are interested, you can create a citation alert for '"roots of bioinformatics" Plos Computational Biology in Entrez.
The review by David Searles in June, 2010 in PLoS Computational Biology on the roots of bioinformatics will certainly point you to some classic papers, including some you likely never thought of as belonging to this field. This review was very well written and was a joy to read. The paper is here.
Interesting question! I'm beginning to work on this field, I actually started a few months ago.
I've read some papers but I quite enjoy that one
The Clustal paper(s) - one of the most cited paper(s) in the world (all scientific areas)
.. Thompson, JD; Gibson, TJ; Plewniak, F; Jeanmougin, F; Higgins, DG The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. NUCLEIC ACIDS RESEARCH, 25 (24): 4876-4882 DEC 15 1997
Chenna, R; Sugawara, H; Koike, T; Lopez, R; Gibson, TJ; Higgins, DG; Thompson, JD Multiple sequence alignment with the Clustal series of programs. NUCLEIC ACIDS RESEARCH, 31 (13): 3497-3500 JUL 1 200 ..
I think that is classic
Ruth Nussinov and George Pieczenik and Jerrold R. Griggs and Daniel J. Kleitman: Algorithms for Loop Matchings. In: SIAM Journal on Applied Mathematics. 35, Nr. 1, Juli 1978, S. 68-82.
30 years ago, she came up with a beautiful dynamic programming algorithm for secondary structure prediction.