Question

Getting Coding Strand (Cds) Using Uniprot Ids

1

Entering edit mode

13.5 years ago

Sam ▴ 70

Hi,

I have a file in the following format:

1    tag    0.108    1    11    B7LTF9    P05100
2    alkA    0.046    2    11    B7LV32    P04395
3    gnd    0.011    2    11    B7LUG0    P00350
4    pgl    0.048    1    11    B7LJZ2    P52697
5    aaeA    0.061    3    11    B7LRL6    P46482
6    aaeB    0.069    3    11    B7LRL5    P46481
...

The last 2 colums are the Uniprot Ids from different species (Escherichia fergusonii ATCC 35469 and Escherichia coli K-12 respectively). Using those Uniprot IDs, I need the nucleotide CDS. I have code to parse the file and get the uniprot ids of each species in individual files. However, I cant figure out how to get the CDS. I have tried Biomart to retrieve the seqs from EMBL bacteria, however, they do not have complete mapping of Uniprot Ids to EMBL bacteria IDs. Please suggest any other way I can accomplish this.

Thank you very much.

cds uniprot id mapping bioperl • 4.6k views

ADD COMMENT • link updated 13.5 years ago by Elisabeth Gasteiger ★ 2.4k • written 13.5 years ago by Sam ▴ 70

score 3 · Answer 1 · 2012-01-18

There are a few solutions to this problem. Since you are starting from UniProt IDs, it's simplest to use the tools at the UniProt website.

From that link, click the "ID Mapping" tab at top of page. Then either copy/paste or upload a file with your UniProt IDs, 1 per line. From = UniProtKB AC/ID and To = EMBL/GenBank/DDBJ CDS. Then click "Map" which returns, for example:

FromTo
B7LTF9CAQ91018.1
P05100AAA24658.1
P05100CAA27472.1
P05100AAB18526.1
P05100AAC76573.1
P05100BAE77746.1

You can then parse that file and use a tool to return CDS given the EMBL ID. For example, Bioperl includes a tool named bp_fetch, which works like this:

bp_fetch net::embl:AAA24658.1

Result:

>AAA24658 Escherichia coli hypothetical protein
ATGGAACGTTGCGGCTGGGTGAGTCAGGACCCGCTTTATATTGCCTACCATGATAATGAG
TGGGGCGTGCCTGAAACTGACAGTAAAAAACTGTTCGAAATGATCTGCCTTGAAGGGCAG
CAGGCTGGATTATCGTGGATCACCGTCCTCAAAAAACGCGAAAACTATCGCGCCTGCTTT
CATCAGTTCGATCCGGTGAAGGTCGCAGCAATGCAGGAAGAGGATGTCGAAAGACTGGTA
CAGGACGCCGGGATTATCCGCCATCGAGGGAAAATTCAGGCAATTATTGGTAATGCGCGG
GCGTACCTGCAAATGGAACAGAACGGCGAACCGTTTGTCGACTTTGTCTGGTCGTTTGTA
AATCATCAGCCACAGGTGACACAAGCCACAACGTTGAGCGAAATTCCCACATCTACGTCC
GCCTCCGACGCCCTATCTAAGGCACTGAAAAAACGTGGTTTTAAGTTTGTCGGCACCACA
ATCTGTTACTCCTTTATGCAGGCATGTGGGCTGGTGAATGATCATGTGGTTGGCTGCTGT
TGCTATCCGGGAAATAAACCATGA

score 2 · Answer 2 · 2012-01-19

Neilfws already answered about the ID mapping tool at the uniprot.org.

I would just like to point out that in the vast majority of cases, there is no single nucleic acid reference sequence for a given UniProtKB/Swiss-Prot protein sequence.

The canonical protein sequence is the outcome of thorough curation work, which often involves the merge of various sequences encoded by the same gene (in one species). In the annotation process, the most correct amino acid sequences are chosen and discrepancies are analyzed and documented.

cf this FAQ for more details: http://www.uniprot.org/faq/35