Question

Shell script to reverse complement DNA, but with a twist (certain bases with a long name)

0

Entering edit mode

5.7 years ago

johan ▴ 120

I use this shell script (I'm not the author) to reverse complement DNA:

alias revcomp="echo 'import sys; print \"\".join([dict(zip(\"ACGTacgt\",\"TGCAtgca\"))[c] for c in sys.argv[1][::-1]])' | python -"

In the terminal it works perfect just to write revcomp ATGC which outputs GCAT. And to just complement, I changed it to

TGCAtgc\",\"TGCAtgc.

I'm working with some files, where for reasons I'm not in charge for, some bases written as a word. So let's say that one base is called (askJoeaboutthisone), how would I complement this sequence? I.e. so that AT(askJoeaboutthisone)GC returns CG(askJoeaboutthisone)TA.

In my script each letter is converted, but I wish that (askJoeaboutthisone) is treated as one letter.

Thanks!

shellscript reverse complement • 1.9k views

ADD COMMENT • link updated 5.7 years ago by 5heikki 11k • written 5.7 years ago by johan ▴ 120

0

Entering edit mode

What you're really doing is transliterating (and then reversing the sequence). The problem is, you have to iterate the string character by character.

Is it possible for your example to use mixed case? i.e. your long strings are all lower case, but your normal bases are all upper? It'll be tricky to delineate whats a string from within a string. Are the parenthesis actually part of your 'data' or is that just for the benefit of the post/explanation?

ADD REPLY • link 5.7 years ago by Joe 21k

score 2 · Answer 1 · 2019-02-19

echo "AT(askJoeaboutthisone)GC" \
    | sed 's/(askJoeaboutthisone)/#/g' \
    | tr "[ATGCYRSWKMBDHVatgcyrswkmbdhv]" "[TACGRYSWMKVHDBtacgryswmkvhdb]" \
    | rev \
    | sed 's/#/(askJoeaboutthisone)/g'
GC(askJoeaboutthisone)AT

So, before the complementation (the tr line), we first replace (askJoeaboutthisone) with #, then after complementation and reversing (the rev line), we replace # with (askJoeaboutthisone). We do the temporary replacement so that the letters in the joestring don't get complemented by tr nor reversed by rev

Edit. You can save this as a script:

#!/bin/bash
if [ ! -z "$1" ]; then
    echo "$1" \
        | sed 's/(askJoeaboutthisone)/#/g' \
        | tr "[ATGCYRSWKMBDHVatgcyrswkmbdhv]" "[TACGRYSWMKVHDBtacgryswmkvhdb]" \
        | rev
        | 's/#/(askJoeaboutthisone)/g'
else
    echo ""
    echo "usage: rev_comp_seq_joe_twist DNASEQUENCE"
    echo ""
fi