Question

Displaying Information Regarding The Quality Of The Ensembl Genomes

1

Entering edit mode

11.5 years ago

Anima Mundi ★ 2.9k

Hello,

I would like you to point me out how to display systematically some crucial information regarding the quality of the genomes present in Ensembl. In particular, for every genome, if possible I would like to know depth, coverage and strain(s) used. It seems like the description page (currently accessible from the Ensembl homepage) of the genomes sometimes lacks this information, even for broadly studied species as the rat. Some information I found browsing the site, and in general the web, but I hope there is a systematic summary.

ensembl genome depth-of-coverage coverage • 2.0k views

ADD COMMENT • link updated 11.5 years ago by Andy Yates ▴ 120 • written 11.5 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

I would be interested in that too. I've been looking for it in the past without any success.

ADD REPLY • link 11.5 years ago by Biojl ★ 1.7k

score 1 · Answer 1 · 2012-10-31

Hi,

It is important to differentiate the production and provision of genomes. Ensembl distributes INSDC genomes and then attempts to annotate; we do not produce genomes. INSDC do store a lot of this information so we try to provide links back to INSDC via assembly accessions, location in the core meta table, under the key assembly.accession. This is not filled in for all species though coverage is good. Once you have that you can use NCBI or ENA for some of your required information e.g. Rat's Rnor5.0 accession is GCA000001895.3 giving you two links:

http://www.ncbi.nlm.nih.gov/assembly/GCA_000001895.3

http://www.ebi.ac.uk/ena/data/view/GCA_000001895

From here we can follow the WGS project ID AABR06 (AABR00000000.6) you can get some more assembly information in the COMMENT section:

##Genome-Assembly-Data-START##
Assembly Method       :: Newbler v. 2.0.0-PreRelease-01162009
                         paired with Phrap v. 0. 990329 for Sanger
                         reads; CLC bio for Solid reads
Assembly Name         :: Rnor_5.0
Genome Coverage       :: 3x BAC; 6x WGS ABI Sanger reads
Sequencing Technology :: Sanger; SOLiD
##Genome-Assembly-Data-END##

There's also strain information at the end as a feature:

FEATURES             Location/Qualifiers
     source          1..112651
                     /organism="Rattus norvegicus"
                     /mol_type="genomic DNA"
                     /strain="BN/SsNHsdMCW"
                     /db_xref="taxon:10116"

If the information is not available for these species then you will have to go back to the data producer e.g. Baylor or the genome paper.

On a side note next-gen based genomes metrics like depth doesn't really mean that much and other metrics like N50 can be mis-leading. The Assemblathon (http://assemblathon.org/) is doing a good job at addressing this issue and perhaps for next-gen genomes you may want to switch to using their recommended metrics.