Hello. I'm doing some analysis using a human protein interaction network (http://cbg.garvan.unsw.edu.au/pina/download/Homo%20sapiens-20121210.sif) available from PINA (http://cbg.garvan.unsw.edu.au/pina/interactome.stat.do). When I query UniProt for the identifiers in the network, it turns out that a good amount of them belong to now deleted entries. See for example http://www.uniprot.org/uniprot/Q27223 and http://www.uniprot.org/uniprot/Q8NI70. The UniProt FAQ (http://www.uniprot.org/faq/11) explains that deleted entries are most likely caused by the associated nucletotide sequence data being retracted or coming to be recognized as non-coding/a pseudogene. With that in mind, how should I deal with those supposed proteins whose entries have been deleted from UniProt? Is it best to simply delete them from the network, no questions asked?
the entry Q8NI70 is deleted in Uniprot, but on the NCBI server is still a valid entry ( http://www.ncbi.nlm.nih.gov/protein/Q8NI70.1?report=girevhist ). Shouldn't the two databases be synchronized?
Thanks for pointing this out, it is indeed surprising. We have brought this to the attention of NCBI staff.
Thank you for the information. I wonder how a C. elegans protein made its made into a supposedly human PIN. I still will welcome suggestions on how to handle such deleted entries in the context of a PIN. Perhaps it could depend on the reason for deletion and if an equivalent entry exists. I imagine there might be many categories or reasons for deletion- is there any way to programatically look up the reason for deletion, or find the equivalent entry if a new one has been made?
There is unfortunately no way to look up the reason for deletion. If a new entry has been linked to an obsolete entry, it will be found by an accession number query. However, if this is not the case, like in many deletions from the unreviewed TrEMBL section, the only way of finding a new corresponding entry is unfortunately a sequence similarity search.
I just downloaded the human PIN set. The first column contains mainly identifiers for human UniProtKB entries, although there are also a few human viruses. The second column contains proteins from various organisms. (I extracted the UniProtKB identifiers from the first and second column, submitted them to a "batch retrieval" on uniprot.org, then viewed the corresponding UniProtKB entries by taxonomy.)