Question: Fast sites removal from alignment
0
gravatar for jan
14 months ago by
jan0
jan0 wrote:

Hi all,

I have a file listing site-specific substitution rates (file.rate; please see below). In several steps (à 4,000 sites), I would like to remove the fastest evolving sites (shown in column "Rate") from my AA alignment (file.ali). I would be very happy and grateful if you could help out (maybe in python, perl, awk).

Site Rate Cat C_Rate

1 2.74582 4 2.74582

2 0.31646 2 0.31554

3 0.77656 3 0.88431

4 0.08958 1 0.05433

5 ... ...

The alignment file is a normal fasta (one-liner):

>Spec1
TDKCKPKKCHLECKKNCPIVKTGKSSKIAFISEMLCIGCGICVK
>Spec2
TDKCKPK----EKKKNCPIVGTGKSSKIAFISEMLCIGCGICVK
>Spec3
????KKCHLECKKNCPIVKTGK-----IAFISEMLCIGCGICVK
sequence alignment • 417 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by jan0
0
gravatar for Hugo
14 months ago by
Hugo150
Universidade de Vigo, Ourense (Spain)
Hugo150 wrote:

You can use awk in first place to obtain the list of sites. For instance, the following command prints only lines where "C_Rate" (fourth column, $4 in awk) is higher or equal than 1:

awk 'NR > 1 { if ($4 >= 1) print $0 }' sites.list

I would recommend you to get only the first column and append the "Spec" word and save it into a file:

awk 'NR > 1 { if ($4 >= 1) print "Spec"$1 }' sites.list > filtered.sites.list

Then you could go to SEDA www.sing-group.org/seda) load the filtered.sites.list in the "Pattern filtering" operation (http://www.sing-group.org/seda/manual/operations.html#pattern-filtering), and use it to filter your FASTA file.

ADD COMMENTlink modified 14 months ago by genomax70k • written 14 months ago by Hugo150

Thanks, but how is this removing sites from the alignment?

ADD REPLYlink written 14 months ago by jan0

Ok, I understand now. How is your AA file? I think that you should use this awk filter first and then filter the remaining sites. Please, show us your AA file to see how this can be achieved.

ADD REPLYlink written 14 months ago by Hugo150

I have updated my answer, hope this can be useful to you.

ADD REPLYlink written 14 months ago by Hugo150

No, this is very cumbersome as I would have to find all the different thresholds to use by myself etc. Anyway, thanks for your suggestions.

ADD REPLYlink written 14 months ago by jan0
0
gravatar for shenwei356
14 months ago by
shenwei3564.7k
China
shenwei3564.7k wrote:

If you only want remove the ONE fastest site, firstly get the site:

$ cat rate.tsv | csvtk sort -t -k Rate:nr | csvtk cut -t -f Site | sed -n 2p 
1

Supposing it's 3rd not 1, next, retrieve sub-sequence on the left and right, and then re-concatenate them.

$ seqkit subseq -r 1:2 seqs.fa  > left.fa
$ seqkit subseq -r 4:-1 seqs.fa > right.fa

$ seqkit concat left.fa right.fa 
>Spec1
TDCKPKKCHLECKKNCPIVKTGKSSKIAFISEMLCIGCGICVK
>Spec2
TDCKPK----EKKKNCPIVGTGKSSKIAFISEMLCIGCGICVK
>Spec3
???KKCHLECKKNCPIVKTGK-----IAFISEMLCIGCGICVK

If you want remove multiple sites, it's a bit completed for shell command, you'd better write a script.

ADD COMMENTlink written 14 months ago by shenwei3564.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1599 users visited in the last hour