Question: Fast sites removal from alignment
0
gravatar for jan
22 months ago by
jan0
jan0 wrote:

Hi all,

I have a file listing site-specific substitution rates (file.rate; please see below). In several steps (à 4,000 sites), I would like to remove the fastest evolving sites (shown in column "Rate") from my AA alignment (file.ali). I would be very happy and grateful if you could help out (maybe in python, perl, awk).

Site Rate Cat C_Rate

1 2.74582 4 2.74582

2 0.31646 2 0.31554

3 0.77656 3 0.88431

4 0.08958 1 0.05433

5 ... ...

The alignment file is a normal fasta (one-liner):

>Spec1
TDKCKPKKCHLECKKNCPIVKTGKSSKIAFISEMLCIGCGICVK
>Spec2
TDKCKPK----EKKKNCPIVGTGKSSKIAFISEMLCIGCGICVK
>Spec3
????KKCHLECKKNCPIVKTGK-----IAFISEMLCIGCGICVK
sequence alignment • 484 views
ADD COMMENTlink modified 22 months ago • written 22 months ago by jan0
0
gravatar for Hugo
22 months ago by
Hugo250
Universidade de Vigo, Ourense (Spain)
Hugo250 wrote:

You can use awk in first place to obtain the list of sites. For instance, the following command prints only lines where "C_Rate" (fourth column, $4 in awk) is higher or equal than 1:

awk 'NR > 1 { if ($4 >= 1) print $0 }' sites.list

I would recommend you to get only the first column and append the "Spec" word and save it into a file:

awk 'NR > 1 { if ($4 >= 1) print "Spec"$1 }' sites.list > filtered.sites.list

Then you could go to SEDA www.sing-group.org/seda) load the filtered.sites.list in the "Pattern filtering" operation (http://www.sing-group.org/seda/manual/operations.html#pattern-filtering), and use it to filter your FASTA file.

ADD COMMENTlink modified 22 months ago by genomax80k • written 22 months ago by Hugo250

Thanks, but how is this removing sites from the alignment?

ADD REPLYlink written 22 months ago by jan0

Ok, I understand now. How is your AA file? I think that you should use this awk filter first and then filter the remaining sites. Please, show us your AA file to see how this can be achieved.

ADD REPLYlink written 22 months ago by Hugo250

I have updated my answer, hope this can be useful to you.

ADD REPLYlink written 22 months ago by Hugo250

No, this is very cumbersome as I would have to find all the different thresholds to use by myself etc. Anyway, thanks for your suggestions.

ADD REPLYlink written 22 months ago by jan0
0
gravatar for shenwei356
22 months ago by
shenwei3565.1k
China
shenwei3565.1k wrote:

If you only want remove the ONE fastest site, firstly get the site:

$ cat rate.tsv | csvtk sort -t -k Rate:nr | csvtk cut -t -f Site | sed -n 2p 
1

Supposing it's 3rd not 1, next, retrieve sub-sequence on the left and right, and then re-concatenate them.

$ seqkit subseq -r 1:2 seqs.fa  > left.fa
$ seqkit subseq -r 4:-1 seqs.fa > right.fa

$ seqkit concat left.fa right.fa 
>Spec1
TDCKPKKCHLECKKNCPIVKTGKSSKIAFISEMLCIGCGICVK
>Spec2
TDCKPK----EKKKNCPIVGTGKSSKIAFISEMLCIGCGICVK
>Spec3
???KKCHLECKKNCPIVKTGK-----IAFISEMLCIGCGICVK

If you want remove multiple sites, it's a bit completed for shell command, you'd better write a script.

ADD COMMENTlink written 22 months ago by shenwei3565.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1956 users visited in the last hour