Question: How to download UniProt files using Python and XML
0
gravatar for Good Gravy
4.5 years ago by
Good Gravy20
United Kingdom
Good Gravy20 wrote:

This question follows from another question - How To Retrive A Batch Of Transmembrane Domains From Uniprot? that asks about how to retrieve transmembrane (TM) domains from uniprot.

The top answer in that question mentions that the UniProt XML format can be used to retrieve a fasta sequence of each TM region.

 

filenames = ["O43561.xml", "P08195.xml", "Q58CT8.xml"]
input_format = "uniprot-xml"
feature_type = "transmembrane region"
output_filename = "uniprot_tm.fasta"

 

How can this XML snippet be used as part of a python script (without using biopython modules - How To Retrive A Batch Of Transmembrane Domains From Uniprot?, or Java modules - How To Retrieve Human Proteins Sequence Containing A Given Domain as they have already been solved) to download fasta formatted files from uniprot?

uniprot python • 2.6k views
ADD COMMENTlink modified 4.5 years ago by Bioinformatics_NewComer320 • written 4.5 years ago by Good Gravy20
1
gravatar for Nikhil Chaudhary
4.5 years ago by
India
Nikhil Chaudhary60 wrote:

Assuming you know a decent bit of python (I dont!), you can read the filenames line and split it by the quotes (") or comma (,) and get the uniprot IDs (O43561, P08195 and so on ... ). The URL for each UID fasta file is of the form "http://www.uniprot.org/uniprot/P08195.fasta?include=yes" where you can change ur UID. Search google for a simple python script to download files by url in python using urlib. Now put you uniprot IDs one by one into the downloader script and save the fasta files as you wish.

Hope that answers your question. This method might be slow and non-standard but It is just what I would have used.

ADD COMMENTlink modified 4.5 years ago by RamRS21k • written 4.5 years ago by Nikhil Chaudhary60

The problem still remains how to only get the TM domain. This method does indeed fetch the fasta sequences, but of the entire protein.

ADD REPLYlink written 4.5 years ago by Good Gravy20

Beautifulsoup library in Python can parse HTML page.  You can get coordinates of TM domain. Get coordinates, and parse it from your protein sequence? 

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Bioinformatics_NewComer320

I have a perl script that can parse the TM domain out. But for technical reasons I want to be able to get the TM domains directly from uniprot. The answers mentioned in the question show this is possible in both biopython and java, I am looking for a way to do this in python alone.

ADD REPLYlink written 4.5 years ago by Good Gravy20
1

I tried to find a way to DOWNLOAD only tm region but i couldnt. In that case i guess a good way has already been suggested by Bioinformatics_NewComer., You can try parsing the html page and get coordinates of transmembrane region. Then cut those regions from the full sequences. That is all i can think of. Do post here if you find  a way to do exactly what you want.

ADD REPLYlink written 4.5 years ago by Nikhil Chaudhary60

Will do, thanks for the help!

ADD REPLYlink written 4.5 years ago by Good Gravy20
0
gravatar for Bioinformatics_NewComer
4.5 years ago by
Genomic Island
Bioinformatics_NewComer320 wrote:

I shall try to help you with you. in python 

urllib.urlretrieve

works as wget. So if you have pdb ids, you can use this. *Sorry, cannot get rid of colors. :-( *

You can custom your URL with PDB ids for URL http://www.uniprot.org/uniprot/P02185.fasta

This will download fasta files.

For parsing XML, python has libraries dedicated to it.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Bioinformatics_NewComer320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1640 users visited in the last hour