How To Retrieve Human Proteins Sequence Containing A Given Domain
5
8
Entering edit mode
11.6 years ago

I would like to know what is the best way to retrieve the human proteins sequences that contains a given domain (e.g. : FYVE). Thanks in advance for sharing your approach(es).

human protein protein sequence fasta • 5.8k views
5
Entering edit mode
11.6 years ago

If you have a "type" or a "definition" defined in uniprot (I don't know if it is a controlled vocabulary), here is my java solution.

Compilation:

xjc "http://www.uniprot.org/support/docs/uniprot.xsd"
javac Biostar5862.java org/uniprot/uniprot/*.java


### Test with type="transmembrane region"

java Biostar5862 -t "transmembrane region"
>[11011_ASFP4]|26-46
PFGCNMKGLGVLLGLFSLILA
>[11011_ASFP4]|154-174
LTLKQYCLYFIISIAFAGCFV
>[11011_ASFP4]|183-203
LNTTIKLLTLLSILVYLAQPV
>[141R_IIV6]|49-69
YIIYAIVAAILLLLFWLLYKK
>[14KD_RHOSH]|85-102
LGGFASGALLALALAGIF
>[1A29_HUMAN]|309-332
VGIIAGLVLFGAVFAGAVVAAVRW
>[1B01_PANTR]|306-329
GIVAGLAVLVVTVAVVAVVAAVMC
>[1B54_HUMAN]|309-332
VGIVAGLAVLAVVVIGAVVATVMC
>[1C18_HUMAN]|309-333
VGIVAGLAVLVVLAVLGAVVAVVMC
>[34KD_MYCPA]|42-62
IAVVALGFAAYLLNFGPTFTI


### Test with d="FYVE-type"

java Biostar5862 -d "FYVE-type" | head -n 20
>[FGD1_MOUSE]|729-789
EKEVTMCMRCQEPFNSITKRRHHCKACGHVVCGKCSEFRARLIYDNNRSNRVCTDCYVALH
>[LST2_DROMO]|965-1025
DGKAPRCMSCQTPFTAFRRRHHCRNCGGVFCGVCSNASAPLPKYGLTKAVRVCRECYVREV
>[RFFL_HUMAN]|41-96
TGLEPSCKSCGAHFANTARKQTCLDCKKNFCMTCSSQVGNGPRLCLLCQRFRATAF
>[RNF34_BOVIN]|56-107
EGPNIVCKACGLSFSVFRKKHVCCDCKKDFCSVCSVLQENLRRCSTCHLLQE
>[RUFY1_HUMAN]|642-700
DDEATHCRQCEKEFSISRRKHHCRNCGHIFCNTCSSNELALPSYPKPVRVCDSCHTLLL
>[SYTL4_MOUSE]|63-105
CARCQEGLGRLIPKSSTCVGCNHLVCRECRVLESNGSWRCKVC

0
Entering edit mode

Thanks master Pierre. I would be curious of the result if you use as filter "FYVE" for the section "Sequence similarities Contains 1 FYVE-type zinc finger" and "9606" in the section "Taxonomic identifier 9606 [NCBI]". But I don't their place in the xml file

0
Entering edit mode

Thanks master Pierre. I would be curious of the result if you use as filter "FYVE" for the section "Sequence similarities Contains 1 FYVE-type zinc finger" and "9606" in the section "Taxonomic identifier 9606 [NCBI]". But I don't know their place in the xml file. Here is web instance I used : http://www.uniprot.org/uniprot/Q96K21

0
Entering edit mode

uniprot.org/uniprot/Q96K21.xml gives you the answer for the taxonomy: the path is uniprot/entry/dbReference[@type="NCBI Taxonomy" and @id="9606"]. see the generated classes to see how to get this object.

2
Entering edit mode
11.6 years ago

Here is my Approach to find [?]MYDOMAIN[?]:

1) Got to http://smart.embl.de/ in Genomic Mode (this mode should avoid redundancy)

2) In the [?]Domains detected by SMART[?] section, you type [?]MYDOMAIN[?] in the keywords text box and click "Search for keywords".

3) In the card of your domain of interest click on "Evolution (species in which this domain is found)".

4) Then click on the "Homo sapiens" shortcut to get to the human node.

5) So if you click on the Homo Sapiens node you get access to the "Proteins in Homo sapiens with [?]MYDOMAIN[?] domain" card.

6) From this page you have access to all the protein sequences related to your domain of interest in fasta format.

2
Entering edit mode
11.6 years ago

You can use BioMart web interface in Ensembl. There's a specific filter for genes with a given domain and you can use a broad range of cross ref identifiers (Pfam, Interpro, etc.). I really like this approach 'cause it permits to relate domain with gene structure, to get sequence variation and a lot of other very useful things. Of course, you can't obtain all kinds of raw data (sequences, structures, etc.).

Have you ever tried it?

0
Entering edit mode

Thanks a lot for the suggestion. Unfortunately it seems that I can not use the SMART Ids to filter the genes set.

0
Entering edit mode

I've check it again. Smart IDs are there too. Go to Filters -> Protein Domains -> Limit to genes -> with Protein feature smart IDs.

0
Entering edit mode

I do agree but you can not enter a specific SMART ID to filter. You can only do that with the select item just below that do not cantain a SMART option.

0
Entering edit mode

That's true! Besides that, Ensembl will return only SMART ACCN (erronously called IDs). But, there's a workaround! The BioMart web interface also generates a perl script. You just need to add a few lines.

2
Entering edit mode
11.6 years ago

Already nice solutions here: if it is for one or two protein domain families you can get the list of all domains in an organism using the Species tab(Species distribution) in Pfam. Click on the Check-box next to your organism of interest; then click on Download to download tezt file with sequence accessions or sequences in FASTA format. Pfam also provides a list of domain architecture with FVYE in human. Here is the link to access architecture of 74 Sequences with FYVE domain.

1
Entering edit mode

Thanks Kadher. I tested your approach and it works well. If I restrict to the genes that encode the proteins I got around 40 genes that match with genes from Marina's method.

1
Entering edit mode
11.6 years ago
Marina Manrique ★ 1.3k

Another way to get them is using the advanced search in Uniprot. First select the "Domain" option in the field box and type "FYVE". Then click "Add & Search" and select "Organism" in the field box and type/select "Homo sapiens".

the other day I tried to upload images to the post and I couldn't get it so I've written a short post about this here http://blog.ohnosequences.com/?p=136

Obviously this is not an approach to perform searches programmatically but combining the options you have in this advanced search interface you can perform quite complex searches

0
Entering edit mode

Your solutions works quite well and return 40 human proteins instead of 28 for mine. The additional proteins seem to not be false positive so it is nice.

0
Entering edit mode

Your solution works well and return 40 human proteins instead of 28 for mine. The additional proteins seem to not be false positive so it is cool.

0
Entering edit mode

I was more talking about human genes coding for proteins with this domain