Question: Script for concatenating (multiple fasta sequences in a single file), creating reverse complement and counting the length of final sequence
0
gravatar for adeena_hassan
3.3 years ago by
adeena_hassan50 wrote:

Assalam-o-alaikum everyone,

I have downloaded whole genome of some vertebrates from NCBI and then then fetched coordinates and sequence of genes and CDS from whole genome subsequently.

Now I have CDS file like below:

$ cat input file:

>chr16:7161867-7161955
TTTAGTTTCCCCATCATCCCAGAAAAGTTCGCCTTTTGCTTCTTTGTTTTCATCCAGGGCAATGATAAGACCAAGGGGGTT
>chr16:7161148-7161197
TGGGTGACAGAAAACTCACATAAGAGATAAACTTGATTGGCCACAGTAT
>chr16:7160414-7160581
TGTAGGTTAGAATCATAAGTGACATTAGGAGAAGTTTGACTTGGGATACCATTGTGTTTCACTATAACATTGGTAGGTTCC
>chr16:7157321-7157473
TCCCAGGCACAGCCACGGGCAGTACAGTTTGCTGCNGAAGCACCGCTCTCATCAGGAAAACAGTCTATTTTCTCTTCATC

I want to concatenate these parts of CDS to a single sequence then making its reverse complement and counting length of the final sequence. P.S all parts of CDS in a single file

Expected output:

>Dog
TTTAGTTTCCCCATCATCCCAGAAAAGTTCGCCTTTTGCTTCTTTGTTTTCATCCAGGGCAATGATAAGACCAAGGGGGTTTCACATAAGAGATAAACTTGATTGGCCACAGTATTGTAGGTTAGAATCATAAGTGACATTAGGAGAAGTTTGACTTGGGAATAACATTGGTAGGTTCCTCCCAGGCACAGCCACGGGCAGTACAGTTTGCTGCNGAAGCACCGCTCTCATCAGGAAAACAGTCTATTTTCTCTTC

I have tried following script but its not give my desired output (its not give concatenating sequence against a single fasta headers).

open my $fh,"<","filename.text" or die"error opening $!";

$/ = ">";

<$fh>;

while (<$fh>)
{
    my ($header,@ar) = split("\n",$_);

    my $entry =join("\n",@ar);

    $entry = reverse $entry;

    $entry =~ tr/ACGUacgu/UGCAugca/;

    print ">$header\n$entry\n\n";
}

Any other way to do so ??

ADD COMMENTlink modified 3.3 years ago by Brian Bushnell17k • written 3.3 years ago by adeena_hassan50
2

little late but I want to show you this on Perl:

#!/usr/bin/perl

use strict;
use warnings;

open my $fh,"<","filename.text" or die"error opening $!";

$/ = "\n>"; # better split by the new line and ">"

my $seq = '';

while (<$fh>)
{
    s/>//g;
    my ($header, @seq) = split("\n", $_);
    my $entry = reverse(join("", @seq));
    $entry =~ tr/ACGUacgu/UGCAugca/;
    $seq .= $entry; # concat the new sequence
}

print ">DOG\n$seq\n";
ADD REPLYlink written 3.3 years ago by JC12k
6
gravatar for st.ph.n
3.3 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:
#!/usr/bin/env python

import sys
from Bio.Seq import Seq

with open(sys.argv[1], 'r') as f:
    seq = ''
    for line in f:
        if not line.startswith(">"):
            seq += line.strip()

rc = str(Seq(seq).reverse_complement())

print '>Dog ' + str(len(rc)) + '\n'  + rc

Output (single-line seq):

>Dog 291
GATGAAGAGAAAATAGACTGTTTTCCTGATGAGAGCGGTGCTTCNGCAGCAAACTGTACTGCCCGTGGCTGTGCCTGGGAGGAACCTACCAATGTTATAGTGAAACACAATGGTATCCCAAGTCAAACTTCTCCTAATGTCACTTATGATTCTAACCTACAATACTGTGGCCAATCAAGTTTATCTCTTATGTGAGTTTTCTGTCACCCAAACCCCCTTGGTCTTATCATTGCCCTGGATGAAAACAAAGAAGCAAAAGGCGAACTTTTCTGGGATGATGGGGAAACTAAA
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by st.ph.n2.5k

Hi st.ph.n, I have used your code but it gives following error.

Error:

Traceback (most recent call last):
  File "./reverse_complement.py", line 6, in <module>
    with open(sys.argv[1], 'r') as f:Dog_MGAM_CDS.fa
IndexError: list index out of range

IndentationError: expected an indented block

I don't know much about python could you tell me what to do with it.

ADD REPLYlink written 3.3 years ago by adeena_hassan50
1

You need to provide proper command line arguments.

Python code.py fasta_file
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Renesh1.9k

Thank u so much its working :)

ADD REPLYlink written 3.3 years ago by adeena_hassan50

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 3.3 years ago by WouterDeCoster45k
1

I forgot to mention, sys.argv[1] is the name of the input file. So, if you saved the script as reverse_complement.py, run as python reverse_complement.py Dog_MGAM_CDS.fa

ADD REPLYlink written 3.3 years ago by st.ph.n2.5k
2
gravatar for Renesh
3.3 years ago by
Renesh1.9k
United States
Renesh1.9k wrote:

You can use python,

from Bio import SeqIO
from Bio.Seq import Seq
import argparse

parser = argparse.ArgumentParser(description="Concatenate multiple sequences")
parser.add_argument('-f', action='store', dest='fasta_file', help='Input fasta file')
result = parser.parse_args()
con_seq = ""

for rec in SeqIO.parse(result.fasta_file, "fasta"):
    con_seq += rec.seq

rc_con_seq = Seq(str(con_seq)).reverse_complement()
print rc_con_seq, len(rc_con_seq)

Save above code in code.py file and run as,

python code.py -f fasta_file

Note: You need to have biopython for running this code

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Renesh1.9k
1

Note: there is no need for manual opening of the fasta file, you can just pass the filename to SeqIO.parse()

ADD REPLYlink written 3.3 years ago by WouterDeCoster45k

Thanks for useful information.

ADD REPLYlink written 3.3 years ago by Renesh1.9k
0
gravatar for Brian Bushnell
3.3 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

With the BBMap package:

fuse.sh in=input.fa out=x.fa
reformat.sh in=x.fa out=result.fa rcomp

That will also print the number of bases to the screen.

ADD COMMENTlink written 3.3 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1128 users visited in the last hour
_