Question

Matching Gencode annotations to Assemblies

1

Entering edit mode

8.9 years ago

roberto.spreafico ▴ 20

Hello,

I am trying to match Gencode's annotations to assemblies.

It is my understanding that the sequence of reference chromosomes changes only when there is a major version update (e.g. GRCh37 -> GRCh38). For minor versions (such as GRCh38.p2), patches (deltas between the major version and the new minor version) may be added (as well as haplotypes etc).

Gencode releases the following annotation: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf.gz

that matches the following assembly: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/GRCh38.p2.genome.fa.gz

If one doesn't want the patches, he can refer to the primary assembly: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/GRCh38.primary_assembly.genome.fa.gz

which matches the following annotation: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.primary_assembly.annotation.gtf.gz

But then, what is this annotation for? ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.annotation.gtf.gz

According to the description, this annotation describes reference chromosomes only. So why isn't this suitable for the primary assembly?

Also, it is often suggested not to mix and match Ensembl or Gencode annotations with UCSC assemblies, but given that there are 1:1 matchings (such as hg38 = GRCh38) that should be doable, as long as one takes care that chromosome names follow the same convention.

Similarly, if we remove patches and alternate loci from GRCh38.p2, at that point wouldn't we get back to the primary assembly GRCh38, except for differences in scaffolds? Then, if our annotations of choice only describe reference chromosomes, then those annotations, originally meant for GRCh38, would also work fine with GRCh38.p2. Isn't that the case?

Thank you for your help!

Roberto

GENCODE Assembly Annotations GTF • 3.4k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by roberto.spreafico ▴ 20

Ram · Accepted Answer · 2015-05-28

3

Entering edit mode

8.9 years ago

Emily 23k

The only thing that are added between patch versions (i.e., .p1, .p2) are patches. All of the haplotypes were present in GRCh38.0 and none will be added. New haplotype-like features are called "novel patches", until they are incorporated into a new genome assembly, GRCh39 - although that is likely to be a graph genome so the whole concept of haplotypes will be out the window. So removing patches from GRCh38.p2 would take you back to GRCh38.0, but the haplotypes have to stay. Therefore, the primary assembly for both will be identical, but the primary assembly of GRCh38.p2 is not the same as the complete genome for GRCh38.

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Emily 23k

0

Entering edit mode

Could you expand on the "graph genome" part? I haven't seen that term before and couldn't find any other info, but it sounds like the sort of thing that would be good to know more about.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Adamc ▴ 680

2

Entering edit mode

So graph genomes are a new way of representing genomes and all the possible sequences that they could be. Currently the model is a linear sequence, and haplotypes are shown as sequences on top of the genome. This creates a problem of perception: when you see a genome with sequence on top, you see the primary sequence as the "reference" and the haplotypes as "other". It also means that many analysis tools can work only with the primary assembly, and the haplotypes get skipped from analyses. In fact, the haplotypes are just as relevant and important possible sequences that individuals may have. Indeed there are haplotypes that represent certain populations and ethnic groups, so it is important to ensure that all haplotypes and ethnic groups are considered equal in the eyes of the genome.

The solution is a graph genome. Instead of being completely linear, a graph genome consists of a linear sequence that then splits off into different sequences where there are alternative haplotypes. This means that all of the haplotypes become part of the primary assembly, and any analyses will include all possible sequence. Obviously, this will include a massive redesign of various tools to work with these data.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Emily 23k

1

Entering edit mode

You can find an illustrated example here: https://github.com/adamnovak/schemas/blob/master/doc/GraphModeFAQ.md

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Ying W ★ 4.2k