Question

Extracting loci index from uniprot with high confidence

0

Entering edit mode

8.8 years ago

olle.nordesjo • 0

I'm in the process of automating an assembly of paired multiple sequence alignments (i.e. MSAs with two proteins aligned after each other) in order to do some paired sequence processing on them.

In order to do this, I'm querying two protein families from pfam, and I'm trying to associate them.

I've understood that asserting that the genomic locations of the two proteins are adjacent is appropriate for associating them with high confidence in my case (since they're situated in the same operon)

So, this is the question:

Given the xml-information in uniprot, what is the best way to assert the genomic proximity/adjacency? (Can I find it easily using the BioPython API for example?)

In the case that the specific loci index information is lacking (i suspect this is often the case) it appropriate to compare the uniprot identifiers (e.g. K0D1W6 in http://www.uniprot.org/uniprot/K0D1W6.xml) for similarity using some measure?

Thanks!

Hope to get some intelligent mind out there to help me. I'd be forever grateful!

uniprot loci genome API • 2.0k views

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by olle.nordesjo • 0

Ram · Accepted Answer · 2015-07-20

2

Entering edit mode

8.8 years ago

Elisabeth Gasteiger ★ 2.4k

The letters in UniProtKB accession numbers have absolutely no meaning. The role of an accession number is to uniquely identify an entry, and that's all. Have a look at the user manual section about accession numbers: http://www.uniprot.org/manual/accession_numbers

You may however use the ordered locus names in the gene name field to assert genomic proximity: We call 'Ordered locus name' (OLN) the naming systems that are used to sequentially assign an identifier to each predicted gene of a completely sequenced genome or chromosome. The OLN is generally based on a prefix representing the organism followed by a number which usually represents the sequential ordering of genes on the chromosome. Depending on the genome-sequencing center, OLNs are only attributed to protein-coding genes, or also to pseudogenes, and also to tRNA-coding genes and others. If two predicted genes have been merged to form a new gene, both OLNs are indicated, separated by a slash.

See http://www.uniprot.org/help/gene_name

In your example:

<gene>
  <name type="ordered locus" evidence="10">AMBLS11_16600</name>
</gene>

ADD COMMENT • link updated 17 months ago by Ram 43k • written 8.8 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Thanks a lot!

That helps, especially in the case of merged genes. However, are there certain standards for exactly what the ordered locus name format should contain? Reason being, I'm stumbling upon all of these really weird looking OLNs, for example:

POPTR_0007s09720g
POPTR_0010s11350g

Coming from the same organism. It seems slightly messy

ADD REPLY • link updated 17 months ago by Ram 43k • written 8.8 years ago by olle.nordesjo • 0

1

Entering edit mode

The identifiers you saw seem to be ORFnames in UniProtKB, not OLNs, suggesting that they are not really ordered:

e.g. http://www.uniprot.org/uniprot/U5G7Y3

OLNs are included in UniProtKB only if they were attributed by the group that sequenced the genome.