Question: How To Convert A Big Fasta File With Multi-Line Dna Sequences Into A Fasta File Of Reverse Complement Sequence
3
gravatar for Hamilton
7.4 years ago by
Hamilton260
Hamilton260 wrote:

Hi,

I am trying to convert a big fasta file(as below seqID is started from 1 and upto ~20K) with multi-line DNA sequences into a fasta output format of reverse complement. how can i generate it? there are some available web tools but only one sequence or small file size is acceptable.

Any your input will be appreciated.

###
>1
GAGTTGTCCCATCCCTGGGAATGCATCAGAGGCAGGCAAGAGGATCAGATGGCAAAGTGA
TGAAATCAGAGACAAAGTGGGGTTAACTGAGAGCTCGCTCCCCCAGGCTTCTCTGACCCC
AGCTCCCTCCACTTAAAACACGGTGCCCATTGATGGCAGCTCTGTTTCATTTCCTGTTAT
TCC
>2
GCCACACTATGAGAGAGTTGCAGAGTAAAACGTCTATGTAAACATATCAGATGTATTACA
GTAATACCGTCTGTGCTTACTCAGTTCTCCTTGGCTCGGTTTTATGTACAAACTAAACTA
TTAGGATCCCTTTCATTGTTCCTTCTGGCAGCTCTGTCTAACTTCTG
>3
TTGGCAGTCATAGTGTGCGGCTGTGTCTGCCACTCCTCTGCAAACTCCCCAGCTAAGTAG
ATTGCTGTAAAGCGTGACTAATGAGGTCCTGTGAGCTGACCTAATGAAACCCAACCAGCA
GGGTCACCAGGGCCACCAAAGCGTGAGTCTGCATCTCCCTAGGCCGTGTAACCGGATCCG
CACTGGCTCTGGAGGGAGGGCCAGGAG
>4
TCCACAGGTCATCCATCACCTGGGAGGAAGCCATGCTAGCAGGTAACTGGACTCTCTGCT
CGGTGGCCCGCACACTCTCTGTCTTTCATCTCTGCCCAGCAAGCTGCGTGGGCTTTCAGT
GCCCAGTCATTAACTCTGGCCTT
>5
CAGAGTGTTAACTTACACATGTGGGCATATACATTACGCGGGGCGGATAACTCACTCCTA
ATTTACTGCTGTCTTCTTCTCCGTGTGGAGGATCTTGTTCCTGTCGACAAATTTGGGGAT
CTAATAACAAAGTAGCGGAAGGATTCATCAGGGCAGGTGTAATC
>6
GAGTAAAGTCGACCCCTCAAAGGCGAATACTCTACTGTTTACTTAACCCAACATGTAGGA
GGCACTAGCTGCCAGCAGAGCAGGGATGGAATAACCCTTGATCTGAAGGATTTACAGTGT
AGTGGGTGATACAGATATCCGCGTACACACCAGTGAGCTGACTG
>7
ATGTTTACCCGAGCCTTGAGTGGCAGGTAAATGGACGAGTGGGAGGCACGGCCAGTTTCT
GTTGTGTATCCAGTATAGAATGTCCTAGAGTGGCCAACTGTCCTCTGGATGTGTTATTGG
ATCCTTGAATTCTGATAGCTTCACCAG
>8
agtaagcagcaccctgcacggcctctgcacggcctctgcctcaggttgcaggcctgcctg
aggccctgtcctgccttcctttgaagatgacctgtcatgtggaactgagtgaaatcaatc
ctcttctccccaagctgctttgggtcatggtgttttatca

###

##

fasta dna • 12k views
ADD COMMENTlink modified 2.5 years ago by melendezmatias10 • written 7.4 years ago by Hamilton260

So I'm working on compiling a few programming exercises for people interested in bioinformatics ... and I like the solutions posted here ... I'm wondering if I could get a novice explanation on the motivation to get the reverse compliment sequence

ADD REPLYlink written 7.4 years ago by Delinquentme200
5
gravatar for Pablacious
7.4 years ago by
Pablacious610
Cambridge, UK
Pablacious610 wrote:

You could either use BioPerl or BioPython for this task, rather than directly scripting the correspondence of bases and so on. It would probably be no more than 10 lines in any of these two.

For instance, from the BioPython tutorial (just change genbank for fasta for reading a fasta:

from Bio import SeqIO
for record in SeqIO.parse("ls_orchid.gbk", "genbank"):
    print record.id
    print record.seq.reverse_complement()

Now, if we want to save these reverse complements to a file, we’ll need to make SeqRecord objects. For this I think its most elegant to write our own function, where we can decide how to name our new records:

from Bio.SeqRecord import SeqRecord

def make_rc_record(record):
    """Returns a new SeqRecord with the reverse complement sequence."""
    return SeqRecord(seq = record.seq.reverse_complement(), \
                 id = "rc_" + record.id, \
                 description = "reverse complement")

We can then use this to turn the input records into reverse complement records ready for output. If you don’t mind about having all the records in memory at once, then the Python map() function is a very intuitive way of doing this:

from Bio import SeqIO
records = map(make_rc_record, SeqIO.parse("ls_orchid.fasta", "fasta"))
SeqIO.write(records, "rev_comp.fasta", "fasta")

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc63

ADD COMMENTlink written 7.4 years ago by Pablacious610
2
gravatar for Neilfws
7.4 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

When your data get too large for web-based tools, it's time to learn about locally-installed tools.

See EMBOSS revseq. Don't be confused by the documentation - it handles multiple sequence files.

ADD COMMENTlink written 7.4 years ago by Neilfws48k
2
gravatar for Martin A Hansen
7.4 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Biopieces are pretty ideal for stuff like this:

read_fasta -i data.fna | reverse_seq | complement_seq | write_fasta -o data_rc.fna -x

You can alias this to revcomp_seq as explained here.

ADD COMMENTlink written 7.4 years ago by Martin A Hansen3.0k
1
gravatar for Pierre Lindenbaum
7.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:
ADD COMMENTlink written 7.4 years ago by Pierre Lindenbaum119k
0
gravatar for Damian Kao
7.4 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Here is a quick and dirty python script for reverse complement:

import sys

inFile = open(sys.argv[1],'r')

nuc = {'A':'T','T':'A','G':'C','C':'G','K':'M','M':'K','R':'Y','Y':'R','S':'W','W':'W','B':'V','V':'B','H':'G','D':'C','X':'N','N':'N'}

def revComp(seq):
    rev = ''
    for i in range(len(seq) - 1,-1,-1):
        rev += nuc[seq[i]]

    return rev

header = ''
seq = ''
for line in inFile:
    if line[0] == ">":
        if header != '':
            print header
            print revComp(seq.upper())

        header = line.strip()
        seq = ''
    else:
        seq += line.strip()

print header
print revComp(seq.upper())

Save as yourName.py. Use by: python yourName.py yourSeq.fasta > revComp.fasta

edit*

Added ambiguity codes. It's probably better to go with the BioPython implementation by pablacious. Doesn't hurt to learn how these things work though.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Damian Kao15k

Doesn't handle ambiguity codes. Perhaps too quick and dirty :o)

ADD REPLYlink written 7.4 years ago by Martin A Hansen3.0k
0
gravatar for melendezmatias
2.5 years ago by
melendezmatias10 wrote:

Solution for Command line lovers. Considering the DNA sequences in single-line format in a multifasta file:

cat multifasta_file.txt | while IFS= read L; do if [[ $L == >* ]]; then echo "$L"; else echo $L | rev | tr "ATGCatgc" "TACGtacg"; fi; done > output_file.txt

If your multifasta file is not in single-line format, you can transform your file to single-line before using the command above, like this:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' <multifasta_file.txt &gt;multifasta_file_singleline.txt<="" p="">

Then,

cat multifasta_file_SingleLine.txt | while IFS= read L; do if [[ $L == >* ]]; then echo "$L"; else echo $L | rev | tr "ATGCatgc" "TACGtacg"; fi; done > output_file.txt

Hope it is useful for someone. It took me quite some time to come out with it.

ADD COMMENTlink modified 2.5 years ago by genomax65k • written 2.5 years ago by melendezmatias10

Hi Melendezmatias, I tried your first solution and got the following errors:

-bash: unexpected argument `>' to conditional binary operator
-bash: syntax error near `>*'

Any idea of how to overcome this? Thanks, -Rob

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by roblogan630
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1213 users visited in the last hour