Question

Calculate cysteine content

0

Entering edit mode

7.9 years ago

casley_queiroz ▴ 20

I would like to know a way to calculate cysteine content of more than 3% in a group of sequences. Can anybody help me?

sequence • 1.6k views

ADD COMMENT • link updated 7.9 years ago by Pierre Lindenbaum 161k • written 7.9 years ago by casley_queiroz ▴ 20

0

Entering edit mode

I´d like how many sequences have cysteine content of more than 3% in a group of sequences.

ADD REPLY • link 7.9 years ago by casley_queiroz ▴ 20

GenoMax · Answer 1 · 2016-05-27

2

Entering edit mode

7.9 years ago

Pierre Lindenbaum 161k

using fasta input and awk. (get the number of bases N for each sequence and the number C of cysteins, print the name if C/N > 0.03)

curl -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"  |\
gunzip -c | \
awk '/^>/ {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name); name=$0;N=0.0;C=0.0;next;}  {N+=length($0); gsub(/[^Cc]/,"",$0); C+=length($0);} END {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name);}' |\
head -n1000 | sort -k1,1g

(...)
0.190476    >sp|P37046|3CP1_STRS9 Tricyclic peptide RP 71955 OS=Streptomyces sp. (strain SP9440) PE=1 SV=2
0.190476    >sp|P85078|3CP2_STRSQ Tricyclic peptide MS-271 OS=Streptomyces sp. PE=1 SV=1
0.212766    >sp|P84028|45C1_ANCSP Delta-ctenitoxin-Asp2e OS=Ancylometes sp. PE=1 SV=1
0.228070    >sp|Q6GZT3|043R_FRG3G Uncharacterized protein 043R OS=Frog virus 3 (isolate Goorha) GN=FV3-043R PE=4 SV=1

ADD COMMENT • link 7.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I do not know very well the awk language. I tried using this command line below, but it returned all sequences for me, including those without cysteine.

 {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name); name=$0;N=0.0;C=0.0;next;}  {N+=length($0); gsub(/[^Cc]/,"",$0); C+=length($0);} END {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name);}' |\
head -n1000 | sort -k1,1g <my_sample.fasta > output.fasta

ADD REPLY • link updated 7.9 years ago by GenoMax 141k • written 7.9 years ago by casley_queiroz ▴ 20

1

Entering edit mode

You only need this part

 awk '/^>/ {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name); name=$0;N=0.0;C=0.0;next;}  {N+=length($0); gsub(/[^Cc]/,"",$0); C+=length($0);} END {if(N>0.0 && C/N>0.03) printf("%f\t%s\n",C/N,name);}' <my_sample.fasta > output_file

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

Perfect!! I got it! Thank you very much.

ADD REPLY • link 7.9 years ago by casley_queiroz ▴ 20

1

Entering edit mode

Thanks go to @Pierre! Accept his answer (use the check mark found on his original answer under the upvote sign).

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

`/^>/' is missing (lines starting with '>')

ADD REPLY • link 7.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

and of, course, head -n1000 | sort -k1,1g is just here to speed-up my demo

ADD REPLY • link 7.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Pierre! You helped me a lot.

ADD REPLY • link 7.9 years ago by casley_queiroz ▴ 20