Getting Coding Strand (Cds) Using Uniprot Ids
2
1
Entering edit mode
12.3 years ago
Sam ▴ 70

Hi,

I have a file in the following format:

1    tag    0.108    1    11    B7LTF9    P05100
2    alkA    0.046    2    11    B7LV32    P04395
3    gnd    0.011    2    11    B7LUG0    P00350
4    pgl    0.048    1    11    B7LJZ2    P52697
5    aaeA    0.061    3    11    B7LRL6    P46482
6    aaeB    0.069    3    11    B7LRL5    P46481
...

The last 2 colums are the Uniprot Ids from different species (Escherichia fergusonii ATCC 35469 and Escherichia coli K-12 respectively). Using those Uniprot IDs, I need the nucleotide CDS. I have code to parse the file and get the uniprot ids of each species in individual files. However, I cant figure out how to get the CDS. I have tried Biomart to retrieve the seqs from EMBL bacteria, however, they do not have complete mapping of Uniprot Ids to EMBL bacteria IDs. Please suggest any other way I can accomplish this.

Thank you very much.

cds uniprot id mapping bioperl • 4.1k views
ADD COMMENT
3
Entering edit mode
12.3 years ago
Neilfws 49k

There are a few solutions to this problem. Since you are starting from UniProt IDs, it's simplest to use the tools at the UniProt website.

From that link, click the "ID Mapping" tab at top of page. Then either copy/paste or upload a file with your UniProt IDs, 1 per line. From = UniProtKB AC/ID and To = EMBL/GenBank/DDBJ CDS. Then click "Map" which returns, for example:

FromTo
B7LTF9CAQ91018.1
P05100AAA24658.1
P05100CAA27472.1
P05100AAB18526.1
P05100AAC76573.1
P05100BAE77746.1

You can then parse that file and use a tool to return CDS given the EMBL ID. For example, Bioperl includes a tool named bp_fetch, which works like this:

bp_fetch net::embl:AAA24658.1

Result:

>AAA24658 Escherichia coli hypothetical protein
ATGGAACGTTGCGGCTGGGTGAGTCAGGACCCGCTTTATATTGCCTACCATGATAATGAG
TGGGGCGTGCCTGAAACTGACAGTAAAAAACTGTTCGAAATGATCTGCCTTGAAGGGCAG
CAGGCTGGATTATCGTGGATCACCGTCCTCAAAAAACGCGAAAACTATCGCGCCTGCTTT
CATCAGTTCGATCCGGTGAAGGTCGCAGCAATGCAGGAAGAGGATGTCGAAAGACTGGTA
CAGGACGCCGGGATTATCCGCCATCGAGGGAAAATTCAGGCAATTATTGGTAATGCGCGG
GCGTACCTGCAAATGGAACAGAACGGCGAACCGTTTGTCGACTTTGTCTGGTCGTTTGTA
AATCATCAGCCACAGGTGACACAAGCCACAACGTTGAGCGAAATTCCCACATCTACGTCC
GCCTCCGACGCCCTATCTAAGGCACTGAAAAAACGTGGTTTTAAGTTTGTCGGCACCACA
ATCTGTTACTCCTTTATGCAGGCATGTGGGCTGGTGAATGATCATGTGGTTGGCTGCTGT
TGCTATCCGGGAAATAAACCATGA
ADD COMMENT
0
Entering edit mode

Thank you for your answer. It is very helpful. I tried the uniprot ID mapping before asking this question and I am getting multiple IDs for 1 uniprot ID just like you are. How do i know which one is the correct one?

ADD REPLY
0
Entering edit mode

It's not really a case of which is "correct". Any of the EMBL sequences could be relevant. You would have to do some further analysis of the returned sequences, e.g. how similar are they?

ADD REPLY
2
Entering edit mode
12.3 years ago

Neilfws already answered about the ID mapping tool at the uniprot.org.

I would just like to point out that in the vast majority of cases, there is no single nucleic acid reference sequence for a given UniProtKB/Swiss-Prot protein sequence.

The canonical protein sequence is the outcome of thorough curation work, which often involves the merge of various sequences encoded by the same gene (in one species). In the annotation process, the most correct amino acid sequences are chosen and discrepancies are analyzed and documented.

cf this FAQ for more details: http://www.uniprot.org/faq/35

ADD COMMENT

Login before adding your answer.

Traffic: 2645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6