Question

Sort the fasta sequences based on the desired amino acid content

0

Entering edit mode

2.1 years ago

Pradeep • 0

Dear all, I have a fasta file having multiple protein sequences. I want to sort those protein sequence based on amount of desired amino acids. How can I do that?

For example: I want the following sequences in descending order with D and G amino acid content:

>FastaA
ASDFGHILMNV
>FastaB
SKSYGLKQAPPDTITLIAAKSNS
>FastaC
FQRRYVVWILAVSRHIVFLEN
>FastaD
LAPKDYKLELDDGSDVMK

Output file:

>FastaD
LAPKDYKLELDDGSDVMK
>FastaB
SKSYGLKQAPPDTITLIAAKSNS
>FastaA
ASDFGHILMNV
>FastaC
FQRRYVVWILAVSRHIVFLEN

fasta protein SeqIO biophython • 717 views

ADD COMMENT • link updated 2.1 years ago by cpad0112 21k • written 2.1 years ago by Pradeep • 0

0

Entering edit mode

What exactly do you mean by "amount of desired amino acids"? In any case, if you can code in python, you can easily sort strings (sequences) by a custom function using the sort(key=your_function) syntax.

ADD REPLY • link 2.1 years ago by liorglic ★ 1.4k

0

Entering edit mode

Assuming that sequences are in single line

$ awk -v OFS="\t" '/^>/ {getline seq} {print $0, seq, length(seq),gsub("[DG]","",seq)}' test.fa   |  sort -k4rV,4 -k3rV,3 | awk '{print $1"\n"$2}'    

>FastaD
LAPKDYKLELDDGSDVMK
>FastaB
SKSYGLKQAPPDTITLIAAKSNS
>FastaA
ASDFGHILMNV
>FastaC
FQRRYVVWILAVSRHIVFLEN

If there is a tie, longer sequence gets printed first

ADD REPLY • link 2.1 years ago by cpad0112 21k

score 0 · Answer 1 · 2022-04-14

linearize, create a new column with the number of character, sort on that column, restore the fasta:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' in.fa |\
awk -F '\t' '{S=$2;gsub(/[^GC]/,"",S);printf("%d\t%s\t%s\n",length(S),$1,$2);}' |\
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n"

>FastaC
FQRRYVVWILAVSRHIVFLEN
>FastaA
ASDFGHILMNV
>FastaB
SKSYGLKQAPPDTITLIAAKSNS
>FastaD
LAPKDYKLELDDGSDVMK