Question: Matching Gencode annotations to Assemblies
0
gravatar for roberto.spreafico
3.8 years ago by
United States
roberto.spreafico10 wrote:

Hello,

I am trying to match Gencode's annotations to assemblies.

It is my understanding that the sequence of reference chromosomes changes only when there is a major version update (e.g. GRCh37 -> GRCh38). For minor versions (such as GRCh38.p2), patches (deltas between the major version and the new minor version) may be added (as well as haplotypes etc).

Gencode releases the following annotation:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf.gz

that matches the following assembly:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/GRCh38.p2.genome.fa.gz

 

If one doesn't want the patches, he can refer to the primary assembly:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/GRCh38.primary_assembly.genome.fa.gz

which matches the following annotation:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.primary_assembly.annotation.gtf.gz

But then, what is this annotation for?
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.annotation.gtf.gz

According to the description, this annotation describes reference chromosomes only. So why isn't this suitable for the primary assembly?

Also, it is often suggested not to mix and match Ensembl or Gencode annotations with UCSC assemblies, but given that there are 1:1 matchings (such as hg38 = GRCh38) that should be doable, as long as one takes care that chromosome names follow the same convention. 

Similarly, if we remove patches and alternate loci from GRCh38.p2, at that point wouldn't we get back to the primary assembly GRCh38, except for differences in scaffolds? Then, if our annotations of choice only describe reference chromosomes, then those annotations, originally meant for GRCh38, would also work fine with GRCh38.p2. Isn't that the case?

Thank you for your help!

Roberto

 

gencode annotations assembly gtf • 1.6k views
ADD COMMENTlink modified 3.8 years ago by Emily_Ensembl17k • written 3.8 years ago by roberto.spreafico10
3
gravatar for Emily_Ensembl
3.8 years ago by
Emily_Ensembl17k
EMBL-EBI
Emily_Ensembl17k wrote:

The only thing that are added between patch versions (ie .p1, .p2) are patches. All of the haplotypes were present in GRCh38.0 and none will be added. New haplotype-like features are called "novel patches", until they are incorporated into a new genome assembly, GRCh39 – although that is likely to be a graph genome so the whole concept of haplotypes will be out the window. So removing patches from GRCh38.p2 would take you back to GRCh38.0, but the haplotypes have to stay. Therefore, the primary assembly for both will be identical, but the primary assembly of GRCh38.p2 is not the same as the complete genome for GRCh38.

ADD COMMENTlink written 3.8 years ago by Emily_Ensembl17k

Could you expand on the "graph genome" part? I haven't seen that term before and couldn't find any other info, but it sounds like the sort of thing that would be good to know more about.

ADD REPLYlink written 3.8 years ago by Adamc570
2

So graph genomes are a new way of representing genomes and all the possible sequences that they could be. Currently the model is a linear sequence, and haplotypes are shown as sequences on top of the genome. This creates a problem of perception: when you see a genome with sequence on top, you see the primary sequence as the "reference" and the haplotypes as "other". It also means that many analysis tools can work only with the primary assembly, and the haplotypes get skipped from analyses. In fact, the haplotypes are just as relevant and important possible sequences that individuals may have. Indeed there are haplotypes that represent certain populations and ethnic groups, so it is important to ensure that all haplotypes and ethnic groups are considered equal in the eyes of the genome.

The solution is a graph genome. Instead of being completely linear, a graph genome consists of a linear sequence that then splits off into different sequences where there are alternative haplotypes. This means that all of the haplotypes become part of the primary assembly, and any analyses will include all possible sequence. Obviously, this will include a massive redesign of various tools to work with these data.

ADD REPLYlink written 3.8 years ago by Emily_Ensembl17k
1

You can find an illustrated example here: https://github.com/adamnovak/schemas/blob/master/doc/GraphModeFAQ.md

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Ying W3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 994 users visited in the last hour