Is the Ensembl GRCh38 genome assembly more up to date than the UniProtKB online database?
2
0
Entering edit mode
4 weeks ago
giammafer • 0

Dear all, I am working with a list of Ensembl accession codes for a desired group of proteins.

I have downloaded the protein annotations related to the genome assembly GRCH38.

I fetched the genomic coordinates from UniProtKB API service using the Ensembl accession codes. The service provide a protein annotation records with the coordinate needed.

However, I would like to fetch the same coordinates parsing locally the GRCh38 data, instead to query an online database. I think I found a way that involves FASTA protein sequences file and a GTF protein annotations file for the GRCh38 genome assembly. Through the Ensembl proteins codes (in FASTA sequences) it would be possible to find the Ensembl genes codes in the GTF annotations, and finally in the same annotations, the desired genomic coordinates. Nevertheless, the last update for the GTF annotations file is 19-Mar-2021 while for the protein sequences in FASTA format is from 27-Mar-2021 (today is 19-Sep-2021).

From this discrepancy, it is raised my doubt about the most up-to-date information available.

Now I am wondering:

If I query UniProtKB through an API service, it is possible to find protein annotations not yet included in the GTF annotations set related to a specific genome assembly?(in this case GRCh38 of 27-Mar-2021). In other words, protein annotations fetched from UniProtKB, could be more updated than the 27-Mar-2021 GTF annotations related to the GRCh38?

Moreover:

It is possible that in the UniProtKB database are stored proteins codes with a correspondence in Enseble database (cross-link section in UniProt webpages) but not yet included in the GRCh38 GTF annotations, downloadable through Ensembl FTP service? (I mean the GTF file Homo_sapiens.GRCh38.104.chr.gtf.gz, in this repository http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/)

I am asking this, because if I am interested in the latest update of protein codes and annotations, I think that should be considered the amount of new codes and annotations that are potentially submitted each month. In light of this, if the online databases for instance, update their content with a more higher frequency compare to the genome assembly, I will go for the API querying strategy.

2
Entering edit mode
4 weeks ago

I'm not sure one could say whether one source is more up to date than another. Firstly GRCh38 is a genome assembly version, not an annotation version. Genome assemblies (i.e. the underlying nucleotide sequence) are versioned seperately from their annotations , which regions of the nucleotide sequence correspond to various features, such as genes, or regions that will be transcribed to become cDNAs, which will then be translated into proteins.

New versions of the genome come along every 5 or so years. Almost everyone in the world now uses the GRCh38 genome assembly, and this has been the de facto standard for nearly 7 years now. A new version has recently become available, but it is not in wide usage, nor is it clear if it ever will be.

However, there are several different sources of annotation, none of which is the master, or official version, and all differ slightly. Two version common annotations are the Ensembl annotation and the Entrez annotation. They will differe on which genes are where, or even what is a gene or how many of them there are. Its not that one is correct and one wrong. They are just different. Ensembl releases a new version every three months. I don't know how often the Entrez annotation is updated.

As I understand it the Entrez annotation is based on the RefSeq database of cDNA sequences, which itself is a currated subset of the GenBank database of nucleotide sequences. UniProt protein sequences come from automated translation of sequences in the EMBL-Bank/GenBank/DDBJ databases.These sequences also feed in to the ENSEMBL genome annotation as well, but the Ensembl annotation also uses other data sources. So both UniProt and the genome annotations (e.g. Ensembl), depend independetly on (at least paritally) the same data.

To make things more complex Ensembl/Entrez also do their own translations of their predicted genes which may or may not be the same as the ones in UniProt.

                             _________
_______________            |EMBL-bank|                 _________
| Ensembl/Entrez|<----------|Genbank/ | -------------> |UniProtKB|
---------------  Locate on |  DDBJ   | Translate into  ---------
Translate|          genome    --------     protien
v
---------------
|Ensembl protein|
--------------


The upshot of this is that there is not a 1-to-1 mapping between UniProtIDs and EnsemlbIDs. Not every UniProt entry has a one and only one corrsponding Ensembl ID, and not every Ensembl ID has one and only one corresponding UniProt ID. However, i'm pretty sure that if UniProt is giving you a genome location, it must be getting it from one of the genome annotations, although not neccessrily the Ensembl one (its not clear from the UniProtKB website where this information is coming from).

When you are looking at the Ensembl website, the GTF files will always contain the most up to date information that Ensembl has produced. If UniProt gets its genome lcoation information frmo Ensembl, then I'm sure it will be identical.

0
Entering edit mode

Thank you for the explanation. I appreciated your figure very much. Sometimes it is very difficult to understand how the databases are connected to eachother.

Have you got some suggestions about resources that could explain how to understand the relationship between different annotations?

Regarding my use of the GRCh38 assembly, I realized that it was not clear enough. I downloaded the FASTA and GTF files from this FTP service in Ensembl: https://www.ensembl.org/info/data/ftp/index.html.

On the Human row, I selected 'Protein sequence (FASTA)' and 'Gene sets' (in GTF format).

Because in both file names there was the indication of GRCh38, I supposed that it has been the reference genome assembly for the annotations and protein sequences stored in the files. For this reason, sometimes I am wrong referring to the assembly instead of proteins annotations or sequences.

Please let me know if I am wrong considering GRCh38 assembly in this way?

2
Entering edit mode

It is true that the annotation is inextricably tied to the assembly, and makes no sense without it so should be quoted. But in terms of reproducibility, the release number is more relevant. The annotation is constantly updated, whereas the assembly stays the same. For someone (including yourself) to be able to match up your work to the annotation you worked with, you should include the Ensembl release number.

2
Entering edit mode
4 weeks ago

Ian's answer is great, but I'm going to add in a specific point about sequence. UniProt is a protein database so their main focus is the sequence of the protein, RefSeq is a general nucleotide sequence database so their focus is the sequence of the nucleotide whereas Ensembl is a genome browser so our focus is the sequence of the genome. This means that our sequences may be different. The sequences are obtained from biological molecules, and those molecules must, at some point, have been extracted from real individuals and real individuals have real genetic variants in their genomes. And since these databases take their data from different sources, they come from different individuals, the sequences will reflect the genetic variants in their source. Ensembl protein sequences are always a translation of the reference genome (selenocysteines and other weird biology notwithstanding), but this is not the rule for RefSeq or UniProt, so there may be differences there. This is not to say one is more accurate than another, just that populations are variable.

0
Entering edit mode

Thank you very much Emily, you answer clarified another doubt that I had.

0
Entering edit mode

I hope I am right to continue this topic using a comment to your answer.

In light of my last outcomes and considering what you have stated, I have a further question related to my original one.

I still have a list of UniProt codes and I am trying to fetch the genomic coordinates through the API from programmatic access to UniProtKB (https://www.ebi.ac.uk/proteins/api/doc/).

Moreover, I read that UniProt in collaboration with Ensembl have mapped the protein sequences on the GRCh38 genome (doi: 10.1093/nar/gkw1099)

However, it is true that the number of protein codes in UniProt is updated more frequently than the GRCh38 assembly.

Considering only the protein genomic coordinates data: It is possible that a new submission in UniProt (in TrEMBL for instance) does not have any correspondence with Ensembl, due to the possibility that the genomic coordinates of a new protein do not match any gene model stored in GRCh38?

If not: It is possible to say that all the protein genomic coordinates in UniProt have been mapped on the GRCh38 (in connection with Ensembl) and it is impossible that a new submission in UniProt lacks this data?

Thanks

1
Entering edit mode

Its definately possible that there are models in UniProt that don't have any assigned coordinates in Ensembl or any other genome annotaiton. This is not just because they have not been mapped yet, but perhaps there simply wasn't a good match.

0
Entering edit mode

Thank you Ian