Sort the fasta sequences based on the desired amino acid content
1
0
Entering edit mode
2.1 years ago
Pradeep • 0

Dear all, I have a fasta file having multiple protein sequences. I want to sort those protein sequence based on amount of desired amino acids. How can I do that?

For example: I want the following sequences in descending order with D and G amino acid content:

>FastaA
ASDFGHILMNV
>FastaB
SKSYGLKQAPPDTITLIAAKSNS
>FastaC
FQRRYVVWILAVSRHIVFLEN
>FastaD
LAPKDYKLELDDGSDVMK

Output file:

>FastaD
LAPKDYKLELDDGSDVMK
>FastaB
SKSYGLKQAPPDTITLIAAKSNS
>FastaA
ASDFGHILMNV
>FastaC
FQRRYVVWILAVSRHIVFLEN
fasta protein SeqIO biophython • 717 views
ADD COMMENT
0
Entering edit mode

What exactly do you mean by "amount of desired amino acids"? In any case, if you can code in python, you can easily sort strings (sequences) by a custom function using the sort(key=your_function) syntax.

ADD REPLY
0
Entering edit mode

Assuming that sequences are in single line

$ awk -v OFS="\t" '/^>/ {getline seq} {print $0, seq, length(seq),gsub("[DG]","",seq)}' test.fa   |  sort -k4rV,4 -k3rV,3 | awk '{print $1"\n"$2}'    

>FastaD
LAPKDYKLELDDGSDVMK
>FastaB
SKSYGLKQAPPDTITLIAAKSNS
>FastaA
ASDFGHILMNV
>FastaC
FQRRYVVWILAVSRHIVFLEN

If there is a tie, longer sequence gets printed first

ADD REPLY
0
Entering edit mode
2.1 years ago

linearize, create a new column with the number of character, sort on that column, restore the fasta:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' in.fa |\
awk -F '\t' '{S=$2;gsub(/[^GC]/,"",S);printf("%d\t%s\t%s\n",length(S),$1,$2);}' |\
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n"

>FastaC
FQRRYVVWILAVSRHIVFLEN
>FastaA
ASDFGHILMNV
>FastaB
SKSYGLKQAPPDTITLIAAKSNS
>FastaD
LAPKDYKLELDDGSDVMK
ADD COMMENT

Login before adding your answer.

Traffic: 1305 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6