Question

Genome assembly process

0

Entering edit mode

8.2 years ago

kulvait ▴ 270

Hello, I am quite confused that from my perspective there were two versions of gene assemblies. First is from Genome Reference Consortium and second from Ensembl. Just recently I figured out that Ensembl is probably just copy sequence from Genome Reference Consortium release builds and does nothing with a sequence. Thus for example it is the same to download these data ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/ and these data ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001405.31_GRCh38.p5/GCF_000001405.31_GRCh38.p5_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/ The real difference is that Ensembl do some post processing and keeps data in sync with dbSNP information and other types of information maybe more clearly because it needs actually to use these data by its own tools which are more public.

However I would like to know more about gene build and gene annotation process. What are the steps it includes? Who are the people behind assembling sequence and annotations? What type of tools are they using? What public sources of funding are they consuming to perform their work? Are they performing de novo assemblies of Human and other genomes or are they only currating some sequencing results produced decades away?

I feel that we all are discussing different tools to work with their data but I really want to know more about these reference data and how it all emerge.

Thanks Vojtech.

DNA • 2.0k views

ADD COMMENT • link updated 8.2 years ago by lh3 33k • written 8.2 years ago by kulvait ▴ 270

1

Entering edit mode

Genome Reference Consortium consists of several institutions and EBI is part of the group: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/

There are some slide decks here that describe the process of how assemblies are put together: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/workshops/

ADD REPLY • link 8.2 years ago by GenoMax 144k

0

Entering edit mode

Thank you, very interesting reading.

ADD REPLY • link 8.2 years ago by kulvait ▴ 270

0

Entering edit mode

We in Ensembl work on the annotation of assemblies, we do not produce assemblies. We need the assemblies to be publicly available so that we can import the sequence and run our analyses to annotate genes, variation data, orthologues and paralogues and regions involved with gene regulation. We don't change the genomic sequence but we add value to it by annotating it. The bunch of 'acgt's are meaningless without the annotation we provide. For human and mouse, we get the assemblies from GRC. In Ensembl, we've got a team behind the annotation of genes, the Ensembl Genebuild team. Detailed information on the genebuild of the human assembly can be found here. Have also a look at our papers for additional information, more specifically Ensembl 2016.

ADD REPLY • link 8.2 years ago by Denise CS ★ 5.2k

0

Entering edit mode

Thank you for your answer. I think it is wise to use one version of reference sequence. I suggest that in the readme files ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/README there should be probably stated that it is the same sequence that is also known by name GRCh38.p5 because everyone should not know GCA_000001405.20 identifier and that is actually the same. At least for me it was difficult to realize and up to now I used to refer to it as to GRCh38.ensembl84.

I appreciate the amount of work you do and I would love to see the details.

Sometimes I am a bit confused by the number of different annotations that are produced and how to decide which one is actually "better". There should definitely be some effort to somehow converge and connect results from NCBI and EMBL-EBI and other places.

ADD REPLY • link 8.2 years ago by kulvait ▴ 270

0

Entering edit mode

I'm with you regarding your suggestion. At the moment we state that GRCh38.p5 = GCA_000001405.20 on our annotation page and when using this REST endpoint. Will check if we can include this info on the README of the DNA fasta file on the FTP. If so, we will updating that in forthcoming releases as new patches will be incorporated into the primary assembly e.g. GRCh38.p7 = GCA_000001405.22.

ADD REPLY • link 8.2 years ago by Denise CS ★ 5.2k

0

Entering edit mode

On the README, we state the assembly in that file GCA_000001405.20. A simple search on the web would take you to this page on the NCBI, where we've got the correspondence between GCA_000001405.20 and GRCh38.p5.

ADD REPLY • link 8.2 years ago by Denise CS ★ 5.2k

0

Entering edit mode

When I said the assemblies need to be publicly available I should have made clear this means 'submitted to Genbank, ENA or DDBJ (aka INSDC). Some assemblies can be available but not submitted to one of those consortia.

ADD REPLY • link 8.2 years ago by Denise CS ★ 5.2k

score 0 · Answer 1 · 2016-05-18

GRC produces the reference genome. Ensembl copies it.
GRC is composed of multiple institutes, including EBI/Sanger, NCBI, WashU and more. I know several NCBI/WashU researchers who are doing most of the analyses. Ensembl is maintained by EBI.
GRC/NCBI clearly need to use the data as well and they do keep gene annotation and dbSNP in sync with the reference genome. Ensembl arguably has better interface.
For annotating human/mouse, we often use an evidence-based approach, mapping curated annotations (e.g. refSeq and HAVANA), consensus annotations (e.g. CCDS) and uncurated data (e.g. UniProt, mRNA, EST and RNA-seq) to the reference genome and then seek a consensus. In the past, Ensembl used ab initio gene finders, but I am not sure how much it is used for human and mouse nowadays. You can find details here and here.
For human/mouse genomes, GRC only fixes issues with additional data. They may break a misassembled contig or add missing sequences. Typically they don't do whole genome assembly any more.
NCBI is part of NIH, who receives funding from the US government and gives part of the funding to other research labs. EBI receives funding from multiple governments in Europe, I believe.