Question: fasta - reverse complement sequence
2
gravatar for mbk0asis
2.1 years ago by
mbk0asis320
Korea, Republic Of
mbk0asis320 wrote:

Hi.

I have a fasta file with multiple headers and want to get reverse complement sequences. I usually use FASTX-TOOLKIT, but I want to learn how to do with linux commands.

I tried it using 'awk' and 'if' something like below, but I got a tedious result.

awk '{if($1 ~ />/) print $0; else gsub("[ATCG]","[TAGC]");print $0}' TEST.fa

As you can see, the file has the same string in the header.

How can I get reverse complement sequences without altering the headers?

> ATCG_1
agtcgacATCGATtataggta
> ATCG_2
cgactgcagtcgacATCGACT

Thank you!

bash reverse complement fasta • 3.5k views
ADD COMMENTlink modified 7 months ago by dugong0 • written 2.1 years ago by mbk0asis320
2

This is the kind of thing one shouldn't bother doing with awk or similar tools. Sure, you can come up with a solution, but why bother?

ADD REPLYlink written 2.1 years ago by Devon Ryan79k

I have this script for quick rev comp, but it assumes pure sequence as input (no headers).

#!/bin/bash
if [ ! -z "$1" ]; then
    echo "$1" | tr "[ATGCatgc]" "[TACGtacg]" | rev
else
    echo ""
    echo "usage: rev_comp_seq DNASEQUENCE"
    echo ""
fi
ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by 5heikki7.2k
9
gravatar for Pierre Lindenbaum
2.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum107k wrote:

using 'tr', and 'rev', 2 lines per sequence

 cat input.fa | while read L; do  echo $L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done
ADD COMMENTlink written 2.1 years ago by Pierre Lindenbaum107k

Thank you, Pierre! It works well. What exactly is the difference between $L and "$L"? How does it work?

ADD REPLYlink written 2.1 years ago by mbk0asis320

$L and "$L" there is no difference while read L: the programs reads one line (the header), we print the line; we read the very next line into L; we print the last L, we reverse and translate it.

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum107k

Thank you! I understood!

ADD REPLYlink written 2.1 years ago by mbk0asis320

Thank! But I also have problem: that command can't work for sequencs with several lines.

ADD REPLYlink written 15 months ago by yangjie45460

linearize the sequence and make a 'two line' fasta

ADD REPLYlink written 15 months ago by Pierre Lindenbaum107k

Why does the second 'read L' read the next line? I don't see where the input is coming from.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by q.muncaster0
3
gravatar for venu
2.1 years ago by
venu5.1k
Germany
venu5.1k wrote:

Assuming 2 lines per sequence (header and sequence), if not linearize and try something like following or modify more to get how you need

cat fasta.fa | paste - - | awk -F'\t' -vOFS='\t' '{gsub("A", "T", $2); gsub("T", "A", $2); gsub("G", "C", $2); gsub("C", "G", $2); print}' | tr '\t' '\n'
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by venu5.1k

Thank you for the answer! However, the code didn't work because gsubs serially change the sequences. Firstly. all A's are turned to T's, then second 'gsub' change them back to A's. So finally, all the sequences become a pool of A's and G's. Can I use gsub to change multiple strings at once?

ADD REPLYlink written 2.1 years ago by mbk0asis320
1
gravatar for iraun
2.1 years ago by
iraun3.3k
Norway
iraun3.3k wrote:

Perl?

perl -nle'BEGIN {
    @map{ A, a, C, c, G, g, T, t } = ( T, t, G, g, C, c, A, a )
    }
    print /^>/ ?
        $_ :
              join //, map $map{ $_ }, split //, scalar reverse
    ' file.fa
ADD COMMENTlink written 2.1 years ago by iraun3.3k
0
gravatar for chen
2.1 years ago by
chen1.6k
OpenGene
chen1.6k wrote:

Try OpenGene (https://github.com/OpenGene/OpenGene.jl) to get reverse complement very easily with an operator ~

julia> using OpenGene

julia> seq = dna("AAATTTCCCGGGATCGATCGATCG")
dna:AAATTTCCCGGGATCGATCGATCG

julia> ~seq
dna:CGATCGATCGATCCCGGGAAATTT

ADD COMMENTlink written 2.1 years ago by chen1.6k
0
gravatar for dugong
7 months ago by
dugong0
EU
dugong0 wrote:

Based on Pierre's answer here is a working solution for multi-line fasta.

Changes:

  • You do not lose undetermined sequences "N"-s
  • Works on multi line fasta: it escapes transformation on lines starting with '>'

As a 'one'-liner:

cat play.fa | while read L; do if [[ $L =~ ^'>' ]]; then echo $L; else echo $L | rev | tr "ATGC" "TACG" ; fi ; done

As a bash function

function _revcomplement.file { cat $1 | while read L; do if [[ $L =~ ^'>' ]]; then echo $L; else echo $L | rev | tr "ATGC" "TACG" ; fi ; done } ;

Call:

revcomplement.file play.fa

Tried on play.fa:

>MACHU
AGTCACCTTTACCCGGTTTCANNN
AGTCGCCTTTACCCGGTTTCA
CCCCCGGGGGGGGGGGGGGGG
AGTCACCTTTACCCGGTTTCA
AGTCACCTTTACCCGGTTTCA
AGTCGCCTTTACCCGGTTTCA
>PICCHU
ACTGCAGACACAACTACGGGGTTGTGGAGAGCTTCACAGTGCAGCGGCGAGGTGAGCGCGGCGCGGGGCGGGGCCTGAGTCCCTGTGAGCTGGGAATCTGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAGAGAGAGAGAGCGCCATGTGTG
ACTGCAGACACAACTACGGGGTTGTGGAGAGCTTCACAGTGCAGCGGCGAGGTGAGCGCGGCGCGGGGCGGGGCCTGAGTCCCTGTGAGCTGGGAATCTGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAGAGAGAGAGAGAGAGAGACAGAGAGACAGAGAGAGAGAGAGCGCCATCTGTGAGCATTTAGAATCCTCTCTATCCTGAGCAAGGA
AGTCGCCTTTACCCGGTTTCA

The .fa file has to end with a newline, otherwise the last line is not processed!

ADD COMMENTlink modified 7 months ago • written 7 months ago by dugong0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1588 users visited in the last hour