I would like to know a way to calculate cysteine content of more than 3% in a group of sequences. Can anybody help me?
I would like to know a way to calculate cysteine content of more than 3% in a group of sequences. Can anybody help me?
using fasta input and awk. (get the number of bases N for each sequence and the number C of cysteins, print the name if C/N > 0.03)
curl -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" |\
gunzip -c | \
awk '/^>/ {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name); name=$0;N=0.0;C=0.0;next;} {N+=length($0); gsub(/[^Cc]/,"",$0); C+=length($0);} END {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name);}' |\
head -n1000 | sort -k1,1g
(...)
0.190476 >sp|P37046|3CP1_STRS9 Tricyclic peptide RP 71955 OS=Streptomyces sp. (strain SP9440) PE=1 SV=2
0.190476 >sp|P85078|3CP2_STRSQ Tricyclic peptide MS-271 OS=Streptomyces sp. PE=1 SV=1
0.212766 >sp|P84028|45C1_ANCSP Delta-ctenitoxin-Asp2e OS=Ancylometes sp. PE=1 SV=1
0.228070 >sp|Q6GZT3|043R_FRG3G Uncharacterized protein 043R OS=Frog virus 3 (isolate Goorha) GN=FV3-043R PE=4 SV=1
I do not know very well the awk language. I tried using this command line below, but it returned all sequences for me, including those without cysteine.
{if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name); name=$0;N=0.0;C=0.0;next;} {N+=length($0); gsub(/[^Cc]/,"",$0); C+=length($0);} END {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name);}' |\
head -n1000 | sort -k1,1g <my_sample.fasta > output.fasta
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I´d like how many sequences have cysteine content of more than 3% in a group of sequences.