How To Retrieve Human Proteins Sequence Containing A Given Domain
5
8
Entering edit mode
11.6 years ago

I would like to know what is the best way to retrieve the human proteins sequences that contains a given domain (e.g. : FYVE). Thanks in advance for sharing your approach(es).

human protein protein sequence fasta • 5.8k views
ADD COMMENT
5
Entering edit mode
11.6 years ago

If you have a "type" or a "definition" defined in uniprot (I don't know if it is a controlled vocabulary), here is my java solution.

Compilation:

xjc "http://www.uniprot.org/support/docs/uniprot.xsd"
javac Biostar5862.java org/uniprot/uniprot/*.java

Test with type="transmembrane region"

java Biostar5862 -t "transmembrane region"
>[11011_ASFP4]|26-46
PFGCNMKGLGVLLGLFSLILA
>[11011_ASFP4]|154-174
LTLKQYCLYFIISIAFAGCFV
>[11011_ASFP4]|183-203
LNTTIKLLTLLSILVYLAQPV
>[141R_IIV6]|49-69
YIIYAIVAAILLLLFWLLYKK
>[14KD_RHOSH]|85-102
LGGFASGALLALALAGIF
>[1A29_HUMAN]|309-332
VGIIAGLVLFGAVFAGAVVAAVRW
>[1B01_PANTR]|306-329
GIVAGLAVLVVTVAVVAVVAAVMC
>[1B54_HUMAN]|309-332
VGIVAGLAVLAVVVIGAVVATVMC
>[1C18_HUMAN]|309-333
VGIVAGLAVLVVLAVLGAVVAVVMC
>[34KD_MYCPA]|42-62
IAVVALGFAAYLLNFGPTFTI

Test with d="FYVE-type"

java Biostar5862 -d "FYVE-type" | head -n 20
>[FGD1_MOUSE]|729-789
EKEVTMCMRCQEPFNSITKRRHHCKACGHVVCGKCSEFRARLIYDNNRSNRVCTDCYVALH
>[LST2_DROMO]|965-1025
DGKAPRCMSCQTPFTAFRRRHHCRNCGGVFCGVCSNASAPLPKYGLTKAVRVCRECYVREV
>[RFFL_HUMAN]|41-96
TGLEPSCKSCGAHFANTARKQTCLDCKKNFCMTCSSQVGNGPRLCLLCQRFRATAF
>[RNF34_BOVIN]|56-107
EGPNIVCKACGLSFSVFRKKHVCCDCKKDFCSVCSVLQENLRRCSTCHLLQE
>[RUFY1_HUMAN]|642-700
DDEATHCRQCEKEFSISRRKHHCRNCGHIFCNTCSSNELALPSYPKPVRVCDSCHTLLL
>[SYTL4_MOUSE]|63-105
CARCQEGLGRLIPKSSTCVGCNHLVCRECRVLESNGSWRCKVC
ADD COMMENT
0
Entering edit mode

Thanks master Pierre. I would be curious of the result if you use as filter "FYVE" for the section "Sequence similarities Contains 1 FYVE-type zinc finger" and "9606" in the section "Taxonomic identifier 9606 [NCBI]". But I don't their place in the xml file

ADD REPLY
0
Entering edit mode

Thanks master Pierre. I would be curious of the result if you use as filter "FYVE" for the section "Sequence similarities Contains 1 FYVE-type zinc finger" and "9606" in the section "Taxonomic identifier 9606 [NCBI]". But I don't know their place in the xml file. Here is web instance I used : http://www.uniprot.org/uniprot/Q96K21

ADD REPLY
0
Entering edit mode

uniprot.org/uniprot/Q96K21.xml gives you the answer for the taxonomy: the path is uniprot/entry/dbReference[@type="NCBI Taxonomy" and @id="9606"]. see the generated classes to see how to get this object.

ADD REPLY
2
Entering edit mode
11.6 years ago

Here is my Approach to find [?]MYDOMAIN[?]:

1) Got to http://smart.embl.de/ in Genomic Mode (this mode should avoid redundancy)

2) In the [?]Domains detected by SMART[?] section, you type [?]MYDOMAIN[?] in the keywords text box and click "Search for keywords".

3) In the card of your domain of interest click on "Evolution (species in which this domain is found)".

4) Then click on the "Homo sapiens" shortcut to get to the human node.

5) So if you click on the Homo Sapiens node you get access to the "Proteins in Homo sapiens with [?]MYDOMAIN[?] domain" card.

6) From this page you have access to all the protein sequences related to your domain of interest in fasta format.

ADD COMMENT
2
Entering edit mode
11.6 years ago

You can use BioMart web interface in Ensembl. There's a specific filter for genes with a given domain and you can use a broad range of cross ref identifiers (Pfam, Interpro, etc.). I really like this approach 'cause it permits to relate domain with gene structure, to get sequence variation and a lot of other very useful things. Of course, you can't obtain all kinds of raw data (sequences, structures, etc.).

Have you ever tried it?

ADD COMMENT
0
Entering edit mode

Thanks a lot for the suggestion. Unfortunately it seems that I can not use the SMART Ids to filter the genes set.

ADD REPLY
0
Entering edit mode

I've check it again. Smart IDs are there too. Go to Filters -> Protein Domains -> Limit to genes -> with Protein feature smart IDs.

ADD REPLY
0
Entering edit mode

I do agree but you can not enter a specific SMART ID to filter. You can only do that with the select item just below that do not cantain a SMART option.

ADD REPLY
0
Entering edit mode

That's true! Besides that, Ensembl will return only SMART ACCN (erronously called IDs). But, there's a workaround! The BioMart web interface also generates a perl script. You just need to add a few lines.

ADD REPLY
2
Entering edit mode
11.6 years ago

Already nice solutions here: if it is for one or two protein domain families you can get the list of all domains in an organism using the Species tab(Species distribution) in Pfam. Click on the Check-box next to your organism of interest; then click on Download to download tezt file with sequence accessions or sequences in FASTA format. Pfam also provides a list of domain architecture with FVYE in human. Here is the link to access architecture of 74 Sequences with FYVE domain.

ADD COMMENT
1
Entering edit mode

Thanks Kadher. I tested your approach and it works well. If I restrict to the genes that encode the proteins I got around 40 genes that match with genes from Marina's method.

ADD REPLY
1
Entering edit mode
11.6 years ago
Marina Manrique ★ 1.3k

Another way to get them is using the advanced search in Uniprot. First select the "Domain" option in the field box and type "FYVE". Then click "Add & Search" and select "Organism" in the field box and type/select "Homo sapiens".

the other day I tried to upload images to the post and I couldn't get it so I've written a short post about this here http://blog.ohnosequences.com/?p=136

Obviously this is not an approach to perform searches programmatically but combining the options you have in this advanced search interface you can perform quite complex searches

ADD COMMENT
0
Entering edit mode

Your solutions works quite well and return 40 human proteins instead of 28 for mine. The additional proteins seem to not be false positive so it is nice.

ADD REPLY
0
Entering edit mode

Your solution works well and return 40 human proteins instead of 28 for mine. The additional proteins seem to not be false positive so it is cool.

ADD REPLY
0
Entering edit mode

I was more talking about human genes coding for proteins with this domain

ADD REPLY

Login before adding your answer.

Traffic: 815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6