Question: How To Write A Perl Script To Transform A Sequence In A Tabular File Into A Fasta Format
1
gravatar for redspider19800915
6.3 years ago by
redspider1980091540 wrote:

I have sequences in tab-delimited format as follows:

GGCGGATGTAGCCACGTGGATC    35    12
AGCTGCTGTAGGGTATGGCGAGCC    1    1
TGGATAATGGACGAGTACCGCCTG    14    5
......

I need a perl script that extracts the 1st column (i.e. sequence) and output as follows in the out file:

>seq_1
GGCGGATGTAGCCACGTGGATC
>seq_2
AGCTGCTGTAGGGTATGGCGAGCC
>seq_3
TGGATAATGGACGAGTACCGCCTG
......

Could anybody help me make that? Thanks a lot!

perl • 4.6k views
ADD COMMENTlink modified 6.3 years ago by Istvan Albert ♦♦ 81k • written 6.3 years ago by redspider1980091540
1

What have you tried? Also, try not to create tags simply by splitting sentences into words. "to" is not a useful tag.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Neilfws48k

Thanks for your suggestion.

ADD REPLYlink written 6.3 years ago by redspider1980091540
3
gravatar for Gabriel R.
6.3 years ago by
Gabriel R.2.6k
Center for Geogenetik Københavns Universitet
Gabriel R.2.6k wrote:

One line in awk :

cat [your file]  | awk 'BEGIN{COUNTER=1}{print ">seq_"COUNTER"\n"$1; COUNTER++;}' > output.fa
ADD COMMENTlink modified 6.3 years ago by Sukhdeep Singh9.8k • written 6.3 years ago by Gabriel R.2.6k
8

Or more simply:

awk '{print ">seq_" ++i "\n" $1}' your_file > output.fa

in Awk, the default value of a variable is 0 (no need to declare it). You just need to pre-increment it to start at 1.

Edit: one can also use the internal variable NR (number of records). It avoids the creation of an ad hoc variable i and should be slightly faster (not tested).

awk '{print ">seq_" NR "\n" $1}' your_file > output.fa
ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Frédéric Mahé2.9k
2

This wins the code golf...and is more readable compared to the Perl equivalent.

perl -ne 'print ">seq_".++$i."\n".(split)[0]."\n";' your_file > output.fa
ADD REPLYlink modified 6.3 years ago by Sukhdeep Singh9.8k • written 6.3 years ago by Alastair Kerr5.2k

Thank you Alastair. Using shell commands, I got that:

cut -f 1 your_file | nl | sed -e 's/^\ */>seq_/' -e 's/\t/\n/' > output.fa

Does anyone know how to do it using only sed?

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Frédéric Mahé2.9k
1
gravatar for csiu
6.3 years ago by
csiu40
csiu40 wrote:

Another way to do this is:

$ perl below-script.pl sequence-input.txt

#!/usr/bin/perl                                                             

open (INPUT, $ARGV[0]) or die $!;                                           
open (OUTPUT, ">Output.fa");                                                

while (<INPUT>){                                                            
    chomp;                                                                  
    ($seq) = split("\t");                                                   
    print ">seq_$.\n$seq\n";                                                
}                                                                           

close (OUTPUT);                                                             
close (INPUT);
ADD COMMENTlink written 6.3 years ago by csiu40
0
gravatar for Nari
6.3 years ago by
Nari870
United States
Nari870 wrote:

try this:

#!/usr/bin/perl -w
print"Enter Input File: ";   
$in=<STDIN>;
chomp $in;
open FH, "<$in";
open OUT, ">output_sequence.fasta";
$count=0;
@in=<FH>;  
@line=split/\n/,"@in";   

foreach (@line)    
{    
@word=split('\t',$_);  
$count++;
$word[0]=~s/ +//;              
print OUT ">seq_$count\n$word[0]\n"; 
}
close FH;
close OUT;
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Nari870
0
gravatar for Kenosis
6.3 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Here's another option: perl -lane 'print "seq_$.\n$F[0]"' inFile >outFile

Or as a script:

use strict;
use warnings;

    while (<>) {
        print "seq_$.\n" . (/(\S+)/)[0] . "\n";
    }

Usage: perl script.pl inFile >outFile

Output of both on your dataset:

seq_1
GGCGGATGTAGCCACGTGGATC
seq_2
AGCTGCTGTAGGGTATGGCGAGCC
seq_3
TGGATAATGGACGAGTACCGCCTG

Hope this helps!

ADD COMMENTlink written 6.3 years ago by Kenosis1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1273 users visited in the last hour