Question: How To Retrive A Batch Of Transmembrane Domains From Uniprot?
0
gravatar for kevinjspring
5.6 years ago by
kevinjspring20
United States
kevinjspring20 wrote:

I want to download a batch of sequence data from UniProt but I only want transmembrane annotated regions. On the UniProt website I am able to go into each individual entry and under 'Sequence Annotation (Features)' I can retrieve only the specific area of the sequence I want. This is helpful but I need to do this to many entries so I was looking to see if there is a batch option to retrieve. Any tips on how to download a batch of protein sequences that only contain a specific annotated region?

Example:

LAT, 4F2hc, and LAX are all integral, single-pass transmembrane proteins. The full FASTA sequence for these three proteins are:

>sp|O43561|LAT_HUMAN Linker for activation of T-cells family member 1 OS=Homo sapiens GN=LAT PE=1 SV=1
MEEAILVPCVLGLLLLPILAMLMALCVHCHRLPGSYDSTSSDSLYPRGIQFKRPHTVAPW
PPAYPPVTSYPPLSQPDLLPIPRSPQPLGGSHRTPSSRRDSDGANSVASYENEGASGIRG
AQAGWGVWGPSWTRLTPVSLPPEPACEDADEDEDDYHNPGYLVVLPDSTPATSTAAPSAP
ALSTPGIRDSAFSMESIDDYVNVPESGESAEASLDGSREYVNVSQELHPGAAKTEPAALS
SQEAEEVEEEGAPDYENLQELN
>sp|P08195|4F2_HUMAN 4F2 cell-surface antigen heavy chain OS=Homo sapiens GN=SLC3A2 PE=1 SV=3
MELQPPEASIAVVSIPRQLPGSHSEAGVQGLSAGDDSELGSHCVAQTGLELLASGDPLPS
ASQNAEMIETGSDCVTQAGLQLLASSDPPALASKNAEVTGTMSQDTEVDMKEVELNELEP
EKQPMNAASGAAMSLAGAEKNGLVKIKVAEDEAEAAAAAKFTGLSKEELLKVAGSPGWVR
TRWALLLLFWLGWLGMLAGAVVIIVRAPRCRELPAQKWWHTGALYRIGDLQAFQGHGAGN
LAGLKGRLDYLSSLKVKGLVLGPIHKNQKDDVAQTDLLQIDPNFGSKEDFDSLLQSAKKK
SIRVILDLTPNYRGENSWFSTQVDTVATKVKDALEFWLQAGVDGFQVRDIENLKDASSFL
AEWQNITKGFSEDRLLIAGTNSSDLQQILSLLESNKDLLLTSSYLSDSGSTGEHTKSLVT
QYLNATGNRWCSWSLSQARLLTSFLPAQLLRLYQLMLFTLPGTPVFSYGDEIGLDAAALP
GQPMEAPVMLWDESSFPDIPGAVSANMTVKGQSEDPGSLLSLFRRLSDQRSKERSLLHGD
FHAFSAGPGLFSYIRHWDQNERFLVVLNFGDVGLSAGLQASDLPASASLPAKADLLLSTQ
PGREEGSPLELERLKLEPHEGLLLRFPYAA
>sp|Q58CT8|LAX1_BOVIN Lymphocyte transmembrane adapter 1 OS=Bos taurus GN=LAX1 PE=2 SV=1
MDVTTSAWSETTRRISEPSTLQGTLGSLDKAEDHSSSIFSGFAALLAILLVVAVICVLWC
CGKRKKRQVPYLRVTIMPLLTLPRPRQRAKNIYDLLPRRQEELGRHPSRSIRIVSTESLL
SRNSDSPSSEHVPSRAGDALHMHRAHTHAMGYAVGIYDNAMRPQMCGNLAPSPHYVNVRA
SRGSPSTSSEDSRDYVNIPTAKEIAETLASASNPPRNLFILPGTKELAPSEEIDEGCGNA
SDCTSLGSPGTENSDPLSDGEGSSQTSNDYVNMAELDLGTPQGKQLQGMFQCRRDYENVP
PGPSSNKQQEEEVTSSNTDHVEGRTDGPETHTPPAVQSGSFLALKDHVACQSSAHSETGP
WEDAEETSSEDSHDYENVCAAEAGARG

The data I want is to be able to retrieve from the UniProt site is:

>sp|O43561|5-27
ILVPCVLGLLLLPILAMLMALCV
>sp|P08195|185-205
LLLLFWLGWLGMLAGAVVIIV
>sp|Q58CT8|38-58
IFSGFAALLAILLVVAVICVL

Which corresponds to the single transmembrane domain located in that protein.

The XML data that lists the TM annotation is:

<feature type="transmembrane region" description="Helical; Signal-anchor for type II membrane protein;" status="potential"><location><begin position="185"/><end position="205"/></location></feature>

I might be able to parse this and then use the position data to save only the sequence data needed. Does Biopython have this parser yet?

uniprot biopython • 3.9k views
ADD COMMENTlink modified 5.6 years ago by Peter5.8k • written 5.6 years ago by kevinjspring20
1

You can download the records as UniProt XML, or the old "SwissProt" plain text, and parse them locally to look for transmembrane domains & then extract the sequence for them. At least that's what I would try using Biopython.

Could you give a specific example (e.g. a UniProt protein ID where there are 3 transmembrane domains) and the desired output (e.g. a FASTA file with the region containing the three transmembrane domains only)?

ADD REPLYlink written 5.6 years ago by Peter5.8k

I am primarily interested in single-pass transmembrane proteins.

ADD REPLYlink written 5.6 years ago by kevinjspring20

I updated with some example data. Does BioPython have a parser for XML data from UniProt?

ADD REPLYlink written 5.6 years ago by kevinjspring20
1

Yes, "uniprot-xml" and "swiss" (plain text) are available in Biopython's Bio.SeqIO module, see http://biopython.org/wiki/SeqIO

ADD REPLYlink written 5.6 years ago by Peter5.8k
3
gravatar for Peter
5.6 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

Using the plain text SwissProt format, something like this using Biopython?

# Hard coded list, could use os.listdir(...) or glob?                                                           
filenames = ["O43561.txt", "P08195.txt", "Q58CT8.txt"]
input_format = "swiss"
feature_type = "TRANSMEM"
output_filename = "swiss_tm.fasta"

#Real code starts here...
from Bio import SeqIO
output = open(output_filename, "w")
for filename in filenames:
    # Using SeqIO.parse will cope with multi-record files
    for record in SeqIO.parse(filename, input_format):
        for f in record.features:
            if f.type == feature_type:
                title = "sp|%s|%i-%i" % record.id, f.location.start+1, f.location.end)
                output.write(">%s\n%s\n" % (title, f.extract(record.seq)))
output.close()

Or, using the UniProt XML format, change these lines:

filenames = ["O43561.xml", "P08195.xml", "Q58CT8.xml"]
input_format = "uniprot-xml"
feature_type = "transmembrane region"
output_filename = "uniprot_tm.fasta"

Either should give this as the FASTA format output:

>sp|O43561|5-27
ILVPCVLGLLLLPILAMLMALCV
>sp|P08195|185-205
LLLLFWLGWLGMLAGAVVIIV
>sp|Q58CT8|38-58
IFSGFAALLAILLVVAVICVL

Note I have not talked about how to automatically download the SwissProt/UniProt files, which would be a separate question.

ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Peter5.8k
2
gravatar for Pierre Lindenbaum
5.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

See my answer for How to retrieve human proteins sequence containing a given domain

java Biostar5862 -t "transmembrane region"
>[11011_ASFP4]|26-46
PFGCNMKGLGVLLGLFSLILA
>[11011_ASFP4]|154-174
LTLKQYCLYFIISIAFAGCFV
>[11011_ASFP4]|183-203
LNTTIKLLTLLSILVYLAQPV
>[141R_IIV6]|49-69
YIIYAIVAAILLLLFWLLYKK
>[14KD_RHOSH]|85-102
LGGFASGALLALALAGIF
>[1A29_HUMAN]|309-332
VGIIAGLVLFGAVFAGAVVAAVRW
>[1B01_PANTR]|306-329
GIVAGLAVLVVTVAVVAVVAAVMC
>[1B54_HUMAN]|309-332
VGIVAGLAVLAVVVIGAVVATVMC
>[1C18_HUMAN]|309-333
VGIVAGLAVLVVLAVLGAVVAVVMC
>[34KD_MYCPA]|42-62
IAVVALGFAAYLLNFGPTFTI
ADD COMMENTlink written 5.6 years ago by Pierre Lindenbaum120k

I don't have any experience with Java. I will give it a try, but I was hoping there was something I could use with BioPython.

ADD REPLYlink written 5.6 years ago by kevinjspring20
0
gravatar for Elisabeth Gasteiger
5.6 years ago by
Geneva
Elisabeth Gasteiger1.6k wrote:

See also: UniProt FAQ How can I download the sequences corresponding to a specified domain or region from a list of UniProt entries?

ADD COMMENTlink modified 5.5 years ago • written 5.6 years ago by Elisabeth Gasteiger1.6k

Please ask this as a new question, not as an attempted answer to the transmembrane parsing question.

ADD REPLYlink written 5.6 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 794 users visited in the last hour