Entering edit mode
7.8 years ago
jacobsen.jeremy ▴ 40
A large proportion of the Uniprot database is not linked to a refseq nucleotide id. For instance (Q7KZI7-11,Q7KZI7-13,Q496A3-2,Q8NDM7-3,Q8NDM7-2,Q8NDM7-5). I've counted about 10,000 of these. Why is this, and is there a way to patch this?
Looks like they are isoforms that differ from the canonical sequence. If you drop the (-x) part you will get the original sequence ID which is lined to a RefSeq entry.
Thanks genomax. This is definitely the case for most of the missed mappings, but there are still many others that are not isoform accessions such as (O71037,Q9UKH3,Q6ZUT4). These maybe account for 500 or so entries (much better than 10,000 anyway).
First two entries appear to be some sort of retro-viral proteins and the last one is based on a single mRNA sequence. Not enough evidence for RefSeq curators to act on. Looks like you may have to exclude these entries from whatever analysis you are doing.
You are correct. After taking a closer look it would appear that many of the remaining entries are either contaminants or "putative uncharacterized proteins" There are others that have a ENST identity but no refseq mapping. I'm at a point now where I've been able to map 98% in one way or a another and I suspect that more than half of the remaining 1000 (not 500) entries are non-human contaminants. This is good enough that I can attempt direct sequence matching or annotate by hand. Thanks a lot genomax!