Question

How can I find proteins consisting of a specified number of amino acids?

0

Entering edit mode

5.2 years ago

Thalla • 0

Hi,

I have no experience with biological databases and have no clue how I can find proteins that use for example only four amino aicds. I do research on genetic code evolution and just want to know what proteins could have been produced with only a small number of amino acids given. It would be perfect if I could get all proteins using four (or any other number) amino acids but if this isn't possible I would be happy about an explanation how I could get proteins using a specified set of amino acids (for example Lys, Pro, Ala, Ile).

Thanks in advance.

proteins amino acids • 1.2k views

ADD COMMENT • link updated 5.2 years ago by Pierre Lindenbaum 161k • written 5.2 years ago by Thalla • 0

0

Entering edit mode

There's probably no 'clever' way of doing this other than taking a dataset of proteins, calculating AA compositions, and then just filtering out after the fact, but its pretty brute force.

You might take a look at the answer in this (essentially the same) question:

Amino acid protein software

ADD REPLY • link 5.2 years ago by Joe 21k

score 1 · Answer 1 · 2019-02-06

get proteins from uniprot, linearize the fasta using awk. Use a second awk to count the amino-acids.

$ wget -q  -O - "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" |\
gunzip  -c  |\
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' |\
awk -F '\t'  '{delete a;for(i=1;i<= length($2);i++) a[substr($2,i,1)]=1; if(length(a)<=4) printf("%s\n%s\n",$1,$2);}'

>sp|P35904|ACH1_ACHFU Achatin-1 OS=Achatina fulica OX=6530 PE=1 SV=1
GFAD
>sp|P84761|ACI_MACGN Angiotensin-1-converting enzyme inhibitory peptide OS=Macrocybe gigantea OX=1491104 PE=1 SV=1
GEP
>sp|P02732|ANP3_PAGBO Ice-structuring glycoprotein 3 (Fragments) OS=Pagothenia borchgrevinki OX=8213 PE=1 SV=1
AATAATAATAATAATAATAATAATAATAATA
>sp|P11920|ANP7_ELEGR Ice-structuring glycoprotein 7R OS=Eleginus gracilis OX=8047 PE=1 SV=1
AATAATPATAATPATAARA
>sp|P11921|ANP8_ELEGR Ice-structuring glycoprotein 8R OS=Eleginus gracilis OX=8047 PE=1 SV=1
AATAATPATAATPARA
>sp|P0CU56|ANT_AMAPH Antamanide OS=Amanita phalloides OX=67723 PE=1 SV=1
FFVPPAFFPP
>sp|P84182|AP21_EISFE Antimicrobial peptide OEP3121 OS=Eisenia fetida OX=6396 PE=1 SV=1
ACSAG
>sp|P84071|ASCL_ALLCG Ascalin (Fragment) OS=Allium cepa var. aggregatum OX=28911 PE=1 SV=1
YQCGQGG
>sp|Q156A1|ATX8_HUMAN Ataxin-8 OS=Homo sapiens OX=9606 GN=ATXN8 PE=1 SV=1
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
>sp|P13071|BIOA_CITFR Adenosylmethionine-8-amino-7-oxononanoate aminotransferase (Fragment) OS=Citrobacter freundii OX=546 GN=bioA PE=3 SV=1
MTTDD
>sp|P12997|BIOB_CITFR Biotin synthase (Fragment) OS=Citrobacter freundii OX=546 GN=bioB PE=3 SV=1
MAHSS
>sp|P86723|BTDB_PAPHA Theta defensin subunit B OS=Papio hamadryas OX=9557 PE=1 SV=1
RCVCRRGVC
>sp|P20104|CCF1_ENTFL Sex pheromone cCF10 OS=Enterococcus faecalis OX=1351 PE=1 SV=1
LVTLVFV
>sp|P62567|CDN11_LITCH Caeridin-1.1/1.2/1.3 OS=Litoria chloris OX=86064 PE=1 SV=1
GLLDGLLGTLGL
>sp|P62566|CDN11_LITGI Caeridin-1.1/1.2/1.3 OS=Litoria gilleni OX=39405 PE=1 SV=1
GLLDGLLGTLGL
>sp|P62565|CDN11_LITSP Caeridin-1.1/1.2/1.3 OS=Litoria splendida OX=30345 PE=1 SV=1
GLLDGLLGTLGL
>sp|P62564|CDN11_LITXA Caeridin-1.1/1.2/1.3 OS=Litoria xanthomera OX=79697 PE=1 SV=1
GLLDGLLGTLGL
>sp|P62581|CDN14_LITCH Caeridin-1.4 OS=Litoria chloris OX=86064 PE=1 SV=1
GLLDGLLGGLGL
>sp|P62582|CDN14_LITXA Caeridin-1.4 OS=Litoria xanthomera OX=79697 PE=1 SV=1
GLLDGLLGGLGL
>sp|P86977|CHIT_STRVO Chitinase (Fragment) OS=Streptomyces violaceusniger OX=68280 PE=1 SV=1
GDGTGPGPGP
>sp|P86168|COLUA_COLDE Colutellin-A (Fragments) OS=Colletotrichum dematium OX=34405 PE=1 SV=1
VISIIPV
>sp|P11735|CU30_LOCMI Cuticle protein 30 (Fragment) OS=Locusta migratoria OX=7004 PE=1 SV=1
GLLGLGYGGY
>sp|P80831|CWP07_ARATH 34 kDa cell wall protein (Fragment) OS=Arabidopsis thaliana OX=3702 PE=1 SV=1
EQDRR
>sp|P19916|DCML_PSECH Carbon monoxide dehydrogenase large chain (Fragment) OS=Pseudomonas carboxydohydrogena OX=290 GN=cutL PE=1 SV=1
MGHP
>sp|P19918|DCMS_PSECH Carbon monoxide dehydrogenase small chain (Fragment) OS=Pseudomonas carboxydohydrogena OX=290 GN=cutS PE=1 SV=1
MAKA
>sp|P82079|DYS1_LIMIN Dynastin-1 OS=Limnodynastes interioris OX=30362 PE=1 SV=1
GLLSGLGL
>sp|P82080|DYS2_LIMDU Dynastin-2 OS=Limnodynastes dumerilii OX=104065 PE=1 SV=1
GLLSSLGLNL
(...)