Fast sites removal from alignment
2
0
Entering edit mode
3.4 years ago
jan • 0

Hi all,

I have a file listing site-specific substitution rates (file.rate; please see below). In several steps (à 4,000 sites), I would like to remove the fastest evolving sites (shown in column "Rate") from my AA alignment (file.ali). I would be very happy and grateful if you could help out (maybe in python, perl, awk).

Site Rate Cat C_Rate

1 2.74582 4 2.74582

2 0.31646 2 0.31554

3 0.77656 3 0.88431

4 0.08958 1 0.05433

5 ... ...

The alignment file is a normal fasta (one-liner):

>Spec1
TDKCKPKKCHLECKKNCPIVKTGKSSKIAFISEMLCIGCGICVK
>Spec2
TDKCKPK----EKKKNCPIVGTGKSSKIAFISEMLCIGCGICVK
>Spec3
????KKCHLECKKNCPIVKTGK-----IAFISEMLCIGCGICVK

alignment sequence • 772 views
0
Entering edit mode
3.4 years ago
Hugo ▴ 340

You can use awk in first place to obtain the list of sites. For instance, the following command prints only lines where "C_Rate" (fourth column, $4 in awk) is higher or equal than 1: awk 'NR > 1 { if ($4 >= 1) print $0 }' sites.list  I would recommend you to get only the first column and append the "Spec" word and save it into a file: awk 'NR > 1 { if ($4 >= 1) print "Spec"1 }' sites.list > filtered.sites.list  Then you could go to SEDA www.sing-group.org/seda) load the filtered.sites.list in the "Pattern filtering" operation (http://www.sing-group.org/seda/manual/operations.html#pattern-filtering), and use it to filter your FASTA file. ADD COMMENT 0 Entering edit mode Thanks, but how is this removing sites from the alignment? ADD REPLY 0 Entering edit mode Ok, I understand now. How is your AA file? I think that you should use this awk filter first and then filter the remaining sites. Please, show us your AA file to see how this can be achieved. ADD REPLY 0 Entering edit mode I have updated my answer, hope this can be useful to you. ADD REPLY 0 Entering edit mode No, this is very cumbersome as I would have to find all the different thresholds to use by myself etc. Anyway, thanks for your suggestions. ADD REPLY 0 Entering edit mode 3.4 years ago If you only want remove the ONE fastest site, firstly get the site:  cat rate.tsv | csvtk sort -t -k Rate:nr | csvtk cut -t -f Site | sed -n 2p
1


Supposing it's 3rd not 1, next, retrieve sub-sequence on the left and right, and then re-concatenate them.

$seqkit subseq -r 1:2 seqs.fa > left.fa$ seqkit subseq -r 4:-1 seqs.fa > right.fa

\$ seqkit concat left.fa right.fa
>Spec1
TDCKPKKCHLECKKNCPIVKTGKSSKIAFISEMLCIGCGICVK
>Spec2
TDCKPK----EKKKNCPIVGTGKSSKIAFISEMLCIGCGICVK
>Spec3
???KKCHLECKKNCPIVKTGK-----IAFISEMLCIGCGICVK



If you want remove multiple sites, it's a bit completed for shell command, you'd better write a script.