Script for concatenating (multiple fasta sequences in a single file), creating reverse complement and counting the length of final sequence
3
0
Entering edit mode
7.1 years ago

Assalam-o-alaikum everyone,

I have downloaded whole genome of some vertebrates from NCBI and then then fetched coordinates and sequence of genes and CDS from whole genome subsequently.

Now I have CDS file like below:

$ cat input file:

>chr16:7161867-7161955
TTTAGTTTCCCCATCATCCCAGAAAAGTTCGCCTTTTGCTTCTTTGTTTTCATCCAGGGCAATGATAAGACCAAGGGGGTT
>chr16:7161148-7161197
TGGGTGACAGAAAACTCACATAAGAGATAAACTTGATTGGCCACAGTAT
>chr16:7160414-7160581
TGTAGGTTAGAATCATAAGTGACATTAGGAGAAGTTTGACTTGGGATACCATTGTGTTTCACTATAACATTGGTAGGTTCC
>chr16:7157321-7157473
TCCCAGGCACAGCCACGGGCAGTACAGTTTGCTGCNGAAGCACCGCTCTCATCAGGAAAACAGTCTATTTTCTCTTCATC

I want to concatenate these parts of CDS to a single sequence then making its reverse complement and counting length of the final sequence. P.S all parts of CDS in a single file

Expected output:

>Dog
TTTAGTTTCCCCATCATCCCAGAAAAGTTCGCCTTTTGCTTCTTTGTTTTCATCCAGGGCAATGATAAGACCAAGGGGGTTTCACATAAGAGATAAACTTGATTGGCCACAGTATTGTAGGTTAGAATCATAAGTGACATTAGGAGAAGTTTGACTTGGGAATAACATTGGTAGGTTCCTCCCAGGCACAGCCACGGGCAGTACAGTTTGCTGCNGAAGCACCGCTCTCATCAGGAAAACAGTCTATTTTCTCTTC

I have tried following script but its not give my desired output (its not give concatenating sequence against a single fasta headers).

open my $fh,"<","filename.text" or die"error opening $!";

$/ = ">";

<$fh>;

while (<$fh>)
{
    my ($header,@ar) = split("\n",$_);

    my $entry =join("\n",@ar);

    $entry = reverse $entry;

    $entry =~ tr/ACGUacgu/UGCAugca/;

    print ">$header\n$entry\n\n";
}

Any other way to do so ??

reverse-complement concatenate • 4.1k views
ADD COMMENT
2
Entering edit mode

little late but I want to show you this on Perl:

#!/usr/bin/perl

use strict;
use warnings;

open my $fh,"<","filename.text" or die"error opening $!";

$/ = "\n>"; # better split by the new line and ">"

my $seq = '';

while (<$fh>)
{
    s/>//g;
    my ($header, @seq) = split("\n", $_);
    my $entry = reverse(join("", @seq));
    $entry =~ tr/ACGUacgu/UGCAugca/;
    $seq .= $entry; # concat the new sequence
}

print ">DOG\n$seq\n";
ADD REPLY
6
Entering edit mode
7.1 years ago
st.ph.n ★ 2.7k
#!/usr/bin/env python

import sys
from Bio.Seq import Seq

with open(sys.argv[1], 'r') as f:
    seq = ''
    for line in f:
        if not line.startswith(">"):
            seq += line.strip()

rc = str(Seq(seq).reverse_complement())

print '>Dog ' + str(len(rc)) + '\n'  + rc

Output (single-line seq):

>Dog 291
GATGAAGAGAAAATAGACTGTTTTCCTGATGAGAGCGGTGCTTCNGCAGCAAACTGTACTGCCCGTGGCTGTGCCTGGGAGGAACCTACCAATGTTATAGTGAAACACAATGGTATCCCAAGTCAAACTTCTCCTAATGTCACTTATGATTCTAACCTACAATACTGTGGCCAATCAAGTTTATCTCTTATGTGAGTTTTCTGTCACCCAAACCCCCTTGGTCTTATCATTGCCCTGGATGAAAACAAAGAAGCAAAAGGCGAACTTTTCTGGGATGATGGGGAAACTAAA
ADD COMMENT
0
Entering edit mode

Hi st.ph.n, I have used your code but it gives following error.

Error:

Traceback (most recent call last):
  File "./reverse_complement.py", line 6, in <module>
    with open(sys.argv[1], 'r') as f:Dog_MGAM_CDS.fa
IndexError: list index out of range

IndentationError: expected an indented block

I don't know much about python could you tell me what to do with it.

ADD REPLY
1
Entering edit mode

You need to provide proper command line arguments.

Python code.py fasta_file
ADD REPLY
0
Entering edit mode

Thank u so much its working :)

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY
1
Entering edit mode

I forgot to mention, sys.argv[1] is the name of the input file. So, if you saved the script as reverse_complement.py, run as python reverse_complement.py Dog_MGAM_CDS.fa

ADD REPLY
2
Entering edit mode
7.1 years ago
Renesh ★ 2.2k

You can use python,

from Bio import SeqIO
from Bio.Seq import Seq
import argparse

parser = argparse.ArgumentParser(description="Concatenate multiple sequences")
parser.add_argument('-f', action='store', dest='fasta_file', help='Input fasta file')
result = parser.parse_args()
con_seq = ""

for rec in SeqIO.parse(result.fasta_file, "fasta"):
    con_seq += rec.seq

rc_con_seq = Seq(str(con_seq)).reverse_complement()
print rc_con_seq, len(rc_con_seq)

Save above code in code.py file and run as,

python code.py -f fasta_file

Note: You need to have biopython for running this code

ADD COMMENT
1
Entering edit mode

Note: there is no need for manual opening of the fasta file, you can just pass the filename to SeqIO.parse()

ADD REPLY
0
Entering edit mode

Thanks for useful information.

ADD REPLY
0
Entering edit mode
7.1 years ago

With the BBMap package:

fuse.sh in=input.fa out=x.fa
reformat.sh in=x.fa out=result.fa rcomp

That will also print the number of bases to the screen.

ADD COMMENT

Login before adding your answer.

Traffic: 844 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6