Extracting loci index from uniprot with high confidence
1
0
Entering edit mode
6.3 years ago

I'm in the process of automating an assembly of paired multiple sequence alignments (i.e. MSAs with two proteins aligned after each other) in order to do some paired sequence processing on them.

In order to do this, I'm querying two protein families from pfam, and I'm trying to associate them.

I've understood that asserting that the genomic locations of the two proteins are adjacent is appropriate for associating them with high confidence in my case (since they're situated in the same operon)

So, this is the question:

Given the xml-information in uniprot, what is the best way to assert the genomic proximity/adjacency? (Can I find it easily using the BioPython API for example?)

In the case that the specific loci index information is lacking (i suspect this is often the case) it appropriate to compare the uniprot identifiers (e.g. K0D1W6 in http://www.uniprot.org/uniprot/K0D1W6.xml) for similarity using some measure?

 

Thanks!

 

Hope to get some intelligent mind out there to help me. I'd be forever grateful!

 

uniprot genome loci API • 1.3k views
ADD COMMENT
2
Entering edit mode
6.2 years ago

The letters in UniProtKB accession numbers have absolutely no meaning. The role of an accession number is to uniquely identify an entry, and that's all. Have a look at the user manual section about accession numbers: http://www.uniprot.org/manual/accession_numbers

You may however use the ordered locus names in the gene name field to assert genomic proximity: We call ‘Ordered locus name’ (OLN) the naming systems that are used to sequentially assign an identifier to each predicted gene of a completely sequenced genome or chromosome. The OLN is generally based on a prefix representing the organism followed by a number which usually represents the sequential ordering of genes on the chromosome. Depending on the genome-sequencing center, OLNs are only attributed to protein-coding genes, or also to pseudogenes, and also to tRNA-coding genes and others. If two predicted genes have been merged to form a new gene, both OLNs are indicated, separated by a slash.

see http://www.uniprot.org/help/gene_name

In your example:

<gene>
  <name type="ordered locus" evidence="10">AMBLS11_16600</name>
</gene>

 

ADD COMMENT
0
Entering edit mode

Thanks a lot!

That helps, especially in the case of merged genes. However, are there certain standards for exactly what the ordered locus name format should contain? Reason being, I'm stumbling upon all of these really weird looking OLNs, for example:

POPTR_0007s09720g

POPTR_0010s11350g

Coming from the same organism. It seems slightly messy

ADD REPLY
1
Entering edit mode

The identifiers you saw seem to be ORFnames in UniProtKB, not OLNs, suggesting that they are not really ordered:

e.g. http://www.uniprot.org/uniprot/U5G7Y3

OLNs are included in UniProtKB only if they were attributed by the group that sequenced the genome.

ADD REPLY
0
Entering edit mode

Thanks, that clarifies it!

ADD REPLY

Login before adding your answer.

Traffic: 987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6