Question

How to get specific Amino acid sequences from a list of UniProt IDs

0

Entering edit mode

8.0 years ago

n00bgenome ▴ 40

Hi all,

I've done all of the exercises on CodeAcademy for Python, and I just downloaded Anaconda. If anyone can just point me in the right direction to do this, I'd be very grateful.

(1) I have all of these Uniprot IDS. http://www.genome.jp/dbget-bin/get_linkdb?-t+9+ko:K02405

(2) If you click on any of them, you'll get something like this: http://www.genome.jp/dbget-bin/www_bget?uniprot:A0A023NVK1

(3) On that page, it has identifiers: OS Dyella jiangningensis. OC Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; OC Rhodanobacteraceae; Dyella.

and at the bottom it has the sequences and its characteristics.

(4) FT REGION 18 90 Sigma-70 factor domain-2. FT {ECO:0000256|HAMAP-Rule:MF_00962}. FT REGION 98 170 Sigma-70 factor domain-3. FT {ECO:0000256|HAMAP-Rule:MF_00962}. FT REGION 186 234 Sigma-70 factor domain-4. FT {ECO:0000256|HAMAP-Rule:MF_00962}. FT MOTIF 45 48 Interaction with polymerase core subunit FT RpoC. {ECO:0000256|HAMAP-Rule:MF_00962}. SQ SEQUENCE 254 AA; 28015 MW; F3BD706CB822684E CRC64; MSVASEYLQL QRQSADELVR QHAPLVRRIA YHLMGRLPPS VDVSDLIQAG MIGLLEAARN FATGRNASFE TFAGIRIRGA MLDELRRTDW TPRSVHRKVR EMAEVVRQIE IETGADADDA EVMRRLGIGA EEYHQVLADA ASARLLSLSA PDDADGGAAF DVADGDSLGP QDSVEHEGMR EALVEAIGSL PEREQLVMSL YYEEELNLKE IGAVLGVTES RVCQIHGQAV VRLRARMSGW HDAVEQSQKQ KKKG

The lines that say "Sigma-70 factor domain-2" and "Sigma-70 factor domain-4" specify the amino acid sequences these domains correspond to, in these cases 18-90 and 186-234, respectively. The sequence it corresponds to is at the bottom, starting from "MSVASE....".

What I want to do is to take all the Uniprot IDS (1) and for each UniProt ID(2), to get the identifiers (3) for the Amino Acid sequences specified (4).

So in the above case, it would use the Uniprot ID to spit out the following information:

Uniprot ID: A0A023NVK1 Species: Dyella jiangningensis. Taxonomy: Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Rhodanobacteraceae; Dyella.

Sequence: VR QHAPLVRRIA YHLMGRLPPS VDVSDLIQAG MIGLLEAARN FATGRNASFE TFAGIRIRGA MLDELRRTDW (and) IGSL PEREQLVMSL YYEEELNLKE IGAVLGVTES RVCQIHGQAV VRLR

So how do I get started?

alignment uniprot • 2.2k views

ADD COMMENT • link 8.0 years ago by n00bgenome ▴ 40

score 3 · Accepted Answer · 2016-05-03

I recommend that you access UniProt on http://www.uniprot.org instead of the secondary source at genome.jp. You can access the website programmatically (cf http://www.uniprot.org/help/programmatic_access). Since you are interested in several fields, I recommend that you look at the column customization functionality first, and use the resulting URLs to implement the procedure programmatically: http://www.uniprot.org/help/customize http://insideuniprot.blogspot.ch/2015_03_01_archive.html http://www.uniprot.org/help/uniprotkb_column_names

Please don't hesitate to contact the UniProt helpdesk if you have any additional questions.

PS Maybe you should add the tag "uniprot" to your question?