HGNC cross-references in UniProt
1
0
Entering edit mode
4.8 years ago
cdsouthan ★ 1.9k

There are 19035 protein-coding rows in the HGNC download but the UniProt 19035 column collapses to 18883 infering 432 one-to-many Swiss-Prot > HGNC

However, when I query UniProt with database:(type:hgnc) AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" I get 19960 from the 20,168, implying 905 for the same 1:many - but I can only find 152 duplicates in the column

Can amyone whos been doing something similar help out here? (note it falls between two help desks)

HGNC human proteins uniprot • 1.3k views
ADD COMMENT
0
Entering edit mode

After some hours of head scratching, cross checking and making Venn intersects (see twitter) I think I have an explanation. So no one needs to dive into this if they have better things to do, but I will hold off on my conclusions for a time just to see if anyone wants to come up with an independently corroborative explanation (which I actually think is important for the domain of protein annotation)

ADD REPLY
0
Entering edit mode

Thanks for all the comments, I managed the review in the end "Last rolls of the yoyo: Assessing the human canonical protein count [version 1; referees: awaiting peer review]" https://f1000research.com/articles/6-448/v1 feedback welcome

ADD REPLY
1
Entering edit mode
4.8 years ago
me ▴ 740

In UniProt release 2017_02 there are 171 UniProt/Swiss-Prot entries with more than one HGNC link. While 52 HGNC links point to more than one UniProtKB/Swiss-Prot entry

For data on the HGNC side unfortunately it misses a SPARQL endpoint so no nice way to do this kind of analytics.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
SELECT 
    ?protein 
    (GROUP_CONCAT(SUBSTR(STR(?db),30);separator=',') AS ?hgncs)
WHERE
{
   ?protein a up:Protein .
   ?protein up:reviewed true .
   ?protein rdfs:seeAlso ?db .
   ?db up:database <http://purl.uniprot.org/database/HGNC>
} GROUP BY ?protein HAVING (COUNT(DISTINCT(?db)) >1)

The inverse query asking for hgnc links present in more than one UniProtKB/Swiss-Prot entry.

PREFIX up:<http://purl.uniprot.org/core/> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
SELECT 
?db 
(GROUP_CONCAT(SUBSTR(STR(?protein),33);separator=',') AS ?proteins)
WHERE
{
  ?protein a up:Protein .
  ?protein up:reviewed true .
  ?protein rdfs:seeAlso ?db .
  ?db up:database <http://purl.uniprot.org/database/HGNC>
} GROUP BY ?db HAVING (COUNT(DISTINCT(?protein)) >1)
ADD COMMENT
0
Entering edit mode

OK, thanks, but the biological/curation issue behind the numbers above is as follows:

It looks like Swiss-Prot have included a large number of proteins (in the order of ~ 500-800) that HGNC are not classifying as protein-coding. The largest categories I think (by manual inspection of matches from segments from the Venn I put on twitter) are endogenous retrovirus, long non-coding RNAs and odour receptor pseudogenes. This is numerically dominant over the relatively small one-to-many (SP < > HGNC in both directions as Jerv shows) which I think they agree on as proteins.

ADD REPLY

Login before adding your answer.

Traffic: 2308 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6