Forum:Sequence (annotation) databases in 2021
10 months ago
Dunois ★ 2.0k

But these threads are really old now. Things have probably changed quite significantly in the mean time.

So I would like to start a small discussion on the topic of sequence databases.

Here are some issues that I think the good folks here at Biostars could address.

  • So the various sequence databases such as UniProt, NCBI's various divisions (RefSeq, NR, what else is there?), Ensembl(?), how do these stack up in 2021?
  • What are some other useful databases one should be aware of?
  • Are there any popular databases that have gone sour? (But people still just use them anyway out of habit.)
  • Are there any new and upcoming sequence + sequence annotation resources in the near future?
  • What are some general gotchas, myths, and misunderstandings/misconceptions one should watch out for when it comes to sequence resources?
10 months ago
GenoMax 117k

Primary sequence databases (NCBI/ENA/DDBJ) have been around for decades and are always current. They are primary repository of sequences and sync submissions overnight. They carry annotations for parts of their sequences for others there may be none. GenBank is an archival database so keep in mind that it may have multiple versions of sequences and more importantly some may have errors in them. It is the responsibility of submitters to correct those errors. That is reason it is always preferable to use entries from RefSeq/Homologene since those sections are curated and accurate.

Genome Reference Consortium is the apex body that manages genome releases for important genomes (human, mouse, zebrafish, rat and chicken). They release primary genome builds that then get deposited into appropriate sections of primary databases. Organizations offer annotation that they internally generate (NCBI, Ensembl, UCSC) but the underlying sequence is identical for a given genome build.

There are plenty of other derived/special focus databases. UniProt (LINK) is all things proteins, PDB (LINK) for protein structures.

You will find organism specific databases that originally provided sequence/annotations for those genomes. They were useful in early days of genome sequencing but as large scale sequencing took off they became subject to disappearing grant money. Some have turned to a partial subscription model (e.g. TAIR, BioCyc, KEGG) to support themselves. Parts of subscription databases may still be freely accessible but other parts (and bulk-downloads) require a subscription. If you are lucky enough to have access then you are all set, otherwise you end up having to find other (perhaps less desirable) alternatives. In general, you should be able to find the info you need in some free form elsewhere. It will require more work (and vetting) on your part.

10 months ago
Ben_Ensembl ★ 2.0k

Firstly, just to say, I think GenoMax's overview is fantastic. I wanted to add a section about the data that's available in Ensembl and what we've got planned for the future.

As GenoMax pointed out, Ensembl takes the reference genome assemblies for a number of species from the publicly available primary databases (NCBI/ENA/DDBJ) and adds annotation in four broad categories:

Gene Annotation:

Gene and transcript models are annotated onto the reference genome assemblies using an automated gene annotation pipeline. For selected species (ie human, mouse, zebrafish, rat), gene annotation may also include manual curation, ie reviewed determination of transcripts on a case-by-case basis by the Ensembl-Havana curators. More information:

We link our gene, transcript and peptide features to features in other databases such as UniProt and RefSeq to help in comparison across different databases. More information:

For human annotation, we are currently involved in the MANE collaboration with NCBI to annotate an agreed upon, conserved, highly expressed and biologically relevant transcript for each human gene:

Variation data:

Ensembl imports small and large-scale sequence variants from a number of primary sources (e.g dbSNP and EVA) as well as additional supporting data relating to phenotype, allele frequency and citations. For each variant, we then calculate predicted molecular consequences according to Sequence Ontology (SO) as well as pathogenicity and conservation scores. More information:

We also have a tool for annotating the molecular consequences of your own variation datasets called the Variant Effect Predictor (VEP):

Comparative genomics:

We perform a number of comparative analyses between the genes and genome sequences of species present in Ensembl to predict gene trees and homology relationships as well as whole genome alignments. More information:

Regulatory data:

For human and mouse, we predict the position and activity of regulatory features in a variety of cell types through an analysis of datasets from the ENCODE, RoadMap and BluePrint epigenomics projects. More information:

All of this data is available through the web interface, but you can access the data in a variety of scales through the BioMart tool, REST API and FTP download.

The Ensembl resources mentioned above provide data primarily for vertebrate species but we have a sister-project called 'Ensembl Genomes', which provides genome annotation data and visualisation for non-vertebrate species; divided into plant, fungi, protist, bacteria and non-vertebrate metazoan categories.

We have also recently launched the Ensembl Rapid Release genome browser, which provides rapid access to gene annotation data for newly sequenced genomes, without relying not the traditional Ensembl release cycle.

We are also in the process of designing a brand new Ensembl website. The site is currently available to view but has limited functionality, which we hope to add to over time:


