How to find a protein with a specific number of aminoacids
1
Hi all,
Is there any way to find a protein sequence with exactly a certain number of amino acids, for example only two Cysteins?
I will appreciate any advice
sequence
• 1.7k views
Two cysteins, here we go:
curl -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" |\
gunzip -c |\
awk '/^>/ {printf("\n%s \t",$0); next;} {printf("%s",$0);}' |\
awk -F '\t' '{S=$2;gsub(/[^Cc]/,"",S);if(length(S)==2) printf("%s\n%s\n",$1,$2);}'
>sp|Q91G67|029R_IIV6 Uncharacterized protein 029R OS=Invertebrate iridescent virus 6 GN=IIV6-029R PE=4 SV=1
MVERLGIAVEDRSPKLRKQAIRERFVLFKKNTERVEKYEYYAIRGQSIYINGRLSKLQSERYPKMIILLDIFCQPNPRNLFLRFKERIDGKSEWENNFTYAGNNIGCTKEMESDMIRIFNELDDEKRDV
>sp|Q6GZU4|032R_FRG3G Uncharacterized protein 032R OS=Frog virus 3 (isolate Goorha) GN=FV3-032R PE=4 SV=1
MVTVTELRATAKNLGIRGYSTMRKAELEEAIRDHGRVSEARVASPRRSPARSPRKSPAGRKSPSKSPAGRKSPSKSPAGRKSPSKSPAGRKSPSKSPAGRKSPSKSPAGRKSPSKSPVRKSPSKSPVRKSPRKSPAAKLQAGDRPASMNICKNLPKQRLVDIATEMGIDLNRESDGKPKTKDQLCADIMGGAGRKSPRKSPSRSPVRKSPSRSPVRKSPVRSPRKSPVRVPSPVRSPVKEKTPVRSPARSEDAGSDLAPRPRRGKAVRLDYDEDDDYSYGASTDNLFSGNKEIPFPTRKRRTRKPEKVFVDVRSPHTLTDSEDEDDMVEVPELEDKEITMPGVLSPYSDEIVERGYVSQGGADYINYIYRTEYALESDESFARGARPKTNKRDSDRAVREAAAAAAIARALDRRSQSGNDEPAVRRRSAPTDSSRESRRDREPQRDIAEPQRDIAEPQRDIAEPQRDIAEPRKVRFREAGSADVRVFERDEPKEYGRVPVRPPLFMPAGEPLQPLKFRPKTPKIDDTIHRAQMVLPSKPSQKETDNYYKQFAGEAVRPSEPVQWDKDDQVLYHKVPAWDDSSYAAAVSAWPMSVDPKQAESVFAEFEQLSAQDSDLIKVRKSIMKALGY
>sp|Q197C3|037L_IIV3 Uncharacterized protein 037L OS=Invertebrate iridescent virus 3 GN=IIV3-037L PE=4 SV=1
MNAATSGIQLNAQTLSQQPAMNTPLIHRSFRDDYTGLVSAGDGLYKRKLKVPSTTRCNKFKWCSIGWSIGALIIFLVYKLEKPHVQPTSNGNLSLIEPEKLVSESQLIQKILNATTPQTTTPEIPSSTEPQELVTEILNTTTPQTTTPEIPSSTEPQELVTEIPSSTEPQEEIFSIFKSPKPEEPGGINSIPQYEQESNNVEDEPPPNKPEEEEDHDNQPLEERHTVPILGDVIIRNKTIIIDGGNETIIIKP
>sp|Q6GZT9|037R_FRG3G uncharacterized protein 037R OS=Frog virus 3 (isolate Goorha) GN=FV3-037R PE=3 SV=1
MQVFLDLDETLIHSIPVSRLGWTKSKPYPVKPFTVQDAGTPLSVMMGSSKAVNDGRKRLATRLSLFKRTVLTDHIMCWRPTLRTFLNGLFASGYKINVWTAASKPYALEVVKALNLKSYGMGLLVTAQDYPKGSVKRLKYLTGLDAVKIPLSNTAIVDDREEVKRAQPTRAVHIKPFTASSANTACSESDELKRVTASLAIIAGRSRRR
>sp|Q91G56|042R_IIV6 Uncharacterized protein 042R OS=Invertebrate iridescent virus 6 GN=IIV6-042R PE=4 SV=1
MATLQQAQQQNNQLTQQNNQLTQQNNQLTQRVNELTRFLEDANRKIQIKENVIKSSEAENRKNLAEINRLHSENHRLIQQSTRTICQKCSMRSN
Login before adding your answer.
Traffic: 3260 users visited in the last hour
Thank you so much for your guide