Question: How can I find proteins consisting of a specified number of amino acids?
0
gravatar for Thalla
10 weeks ago by
Thalla0
Thalla0 wrote:

Hi,

I have no experience with biological databases and have no clue how I can find proteins that use for example only four amino aicds. I do research on genetic code evolution and just want to know what proteins could have been produced with only a small number of amino acids given. It would be perfect if I could get all proteins using four (or any other number) amino acids but if this isn't possible I would be happy about an explanation how I could get proteins using a specified set of amino acids (for example Lys, Pro, Ala, Ile).

Thanks in advance.

proteins amino acids • 120 views
ADD COMMENTlink modified 10 weeks ago by Pierre Lindenbaum119k • written 10 weeks ago by Thalla0

There's probably no 'clever' way of doing this other than taking a dataset of proteins, calculating AA compositions, and then just filtering out after the fact, but its pretty brute force.

You might take a look at the answer in this (essentially the same) question:

Amino acid protein software

ADD REPLYlink written 10 weeks ago by jrj.healey12k
1
gravatar for Pierre Lindenbaum
10 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

get proteins from uniprot, linearize the fasta using awk. Use a second awk to count the amino-acids.

$ wget -q  -O - "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" |\
gunzip  -c  |\
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' |\
awk -F '\t'  '{delete a;for(i=1;i<= length($2);i++) a[substr($2,i,1)]=1; if(length(a)<=4) printf("%s\n%s\n",$1,$2);}'

>sp|P35904|ACH1_ACHFU Achatin-1 OS=Achatina fulica OX=6530 PE=1 SV=1
GFAD
>sp|P84761|ACI_MACGN Angiotensin-1-converting enzyme inhibitory peptide OS=Macrocybe gigantea OX=1491104 PE=1 SV=1
GEP
>sp|P02732|ANP3_PAGBO Ice-structuring glycoprotein 3 (Fragments) OS=Pagothenia borchgrevinki OX=8213 PE=1 SV=1
AATAATAATAATAATAATAATAATAATAATA
>sp|P11920|ANP7_ELEGR Ice-structuring glycoprotein 7R OS=Eleginus gracilis OX=8047 PE=1 SV=1
AATAATPATAATPATAARA
>sp|P11921|ANP8_ELEGR Ice-structuring glycoprotein 8R OS=Eleginus gracilis OX=8047 PE=1 SV=1
AATAATPATAATPARA
>sp|P0CU56|ANT_AMAPH Antamanide OS=Amanita phalloides OX=67723 PE=1 SV=1
FFVPPAFFPP
>sp|P84182|AP21_EISFE Antimicrobial peptide OEP3121 OS=Eisenia fetida OX=6396 PE=1 SV=1
ACSAG
>sp|P84071|ASCL_ALLCG Ascalin (Fragment) OS=Allium cepa var. aggregatum OX=28911 PE=1 SV=1
YQCGQGG
>sp|Q156A1|ATX8_HUMAN Ataxin-8 OS=Homo sapiens OX=9606 GN=ATXN8 PE=1 SV=1
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
>sp|P13071|BIOA_CITFR Adenosylmethionine-8-amino-7-oxononanoate aminotransferase (Fragment) OS=Citrobacter freundii OX=546 GN=bioA PE=3 SV=1
MTTDD
>sp|P12997|BIOB_CITFR Biotin synthase (Fragment) OS=Citrobacter freundii OX=546 GN=bioB PE=3 SV=1
MAHSS
>sp|P86723|BTDB_PAPHA Theta defensin subunit B OS=Papio hamadryas OX=9557 PE=1 SV=1
RCVCRRGVC
>sp|P20104|CCF1_ENTFL Sex pheromone cCF10 OS=Enterococcus faecalis OX=1351 PE=1 SV=1
LVTLVFV
>sp|P62567|CDN11_LITCH Caeridin-1.1/1.2/1.3 OS=Litoria chloris OX=86064 PE=1 SV=1
GLLDGLLGTLGL
>sp|P62566|CDN11_LITGI Caeridin-1.1/1.2/1.3 OS=Litoria gilleni OX=39405 PE=1 SV=1
GLLDGLLGTLGL
>sp|P62565|CDN11_LITSP Caeridin-1.1/1.2/1.3 OS=Litoria splendida OX=30345 PE=1 SV=1
GLLDGLLGTLGL
>sp|P62564|CDN11_LITXA Caeridin-1.1/1.2/1.3 OS=Litoria xanthomera OX=79697 PE=1 SV=1
GLLDGLLGTLGL
>sp|P62581|CDN14_LITCH Caeridin-1.4 OS=Litoria chloris OX=86064 PE=1 SV=1
GLLDGLLGGLGL
>sp|P62582|CDN14_LITXA Caeridin-1.4 OS=Litoria xanthomera OX=79697 PE=1 SV=1
GLLDGLLGGLGL
>sp|P86977|CHIT_STRVO Chitinase (Fragment) OS=Streptomyces violaceusniger OX=68280 PE=1 SV=1
GDGTGPGPGP
>sp|P86168|COLUA_COLDE Colutellin-A (Fragments) OS=Colletotrichum dematium OX=34405 PE=1 SV=1
VISIIPV
>sp|P11735|CU30_LOCMI Cuticle protein 30 (Fragment) OS=Locusta migratoria OX=7004 PE=1 SV=1
GLLGLGYGGY
>sp|P80831|CWP07_ARATH 34 kDa cell wall protein (Fragment) OS=Arabidopsis thaliana OX=3702 PE=1 SV=1
EQDRR
>sp|P19916|DCML_PSECH Carbon monoxide dehydrogenase large chain (Fragment) OS=Pseudomonas carboxydohydrogena OX=290 GN=cutL PE=1 SV=1
MGHP
>sp|P19918|DCMS_PSECH Carbon monoxide dehydrogenase small chain (Fragment) OS=Pseudomonas carboxydohydrogena OX=290 GN=cutS PE=1 SV=1
MAKA
>sp|P82079|DYS1_LIMIN Dynastin-1 OS=Limnodynastes interioris OX=30362 PE=1 SV=1
GLLSGLGL
>sp|P82080|DYS2_LIMDU Dynastin-2 OS=Limnodynastes dumerilii OX=104065 PE=1 SV=1
GLLSSLGLNL
(...)
ADD COMMENTlink written 10 weeks ago by Pierre Lindenbaum119k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2281 users visited in the last hour