Question: Why Perl Or Sed Command Not Working
0
gravatar for biolab
6.1 years ago by
biolab1.2k
biolab1.2k wrote:

Hi everyone I have a fasta file like below.

>miR156a
GACAGAA
>miR156b
GACAGAA
>miR156c
GACAGAA
............

I need to format it as below.

    miR156a   GACAGAA
    miR156b   GACAGAA
    miR156c   GACAGAA
    ............

Firstly i replace all new line with tab, and then replace > with new line. In the first step, I used the command sed -e 's/\n/\t/g' IN > OUT. It didn't work. I tried an alternative perl command cat IN | perl -ne 's/\n/\t/' > OUT. This time OUT file contains nothing. What's my problem? Thank you very much for your answers!

perl • 3.6k views
ADD COMMENTlink modified 6.1 years ago by Vivek Krishnakumar370 • written 6.1 years ago by biolab1.2k

Following my question, i tried new perl command cat IN | perl -ne 'while (<>) {chomp; print "$_\t"}' > OUT and get the following output.

GACAGAA >miR156b^M  GACAGAA >miR156c^M  GACAGAA  ......

Probably mixed use of WINDOWS and LINUX. Could anyone give me some suggestions and comments? Thanks a lot!

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by biolab1.2k

looks like your input file comes from windows and you are on *NIX machine. try running it through dos2unix first e.g. cat IN | dos2unix | perl ...

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by aheinzel110
8
gravatar for Pavel Senin
6.1 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:
cat test.fa | sed -n '/>/ {h; N; s/>//; s/[\r\n]/\t/; p}'

miR156a    GACAGAA
miR156b    GACAGAA
miR156c    GACAGAA

how it works:

sed -n '          # turn off default printing
 />/{             # if the pattern matches a sequence header
 h;               # put it in the hold space
 N;               # fetch the next line
 s/>//;           # remove a '>' symbol
 s/[\r\n]/\t/g;   # 'g' - replace all new line with tab
 p }              # print it
 '
ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Pavel Senin1.9k

Nice, that's rather more concise than my awk solution!

ADD REPLYlink written 6.1 years ago by Devon Ryan94k

thanks! i hope it'll work for OP.

ADD REPLYlink written 6.1 years ago by Pavel Senin1.9k

And you could:

cat test.fa | sed 'h; N; s/>\(.*\)[\r\n]/\1\t/'
ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k
4
gravatar for Devon Ryan
6.1 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

You're creating an extremely long line, at least if your input file is largish. That's likely screwing things up. Why not just do things in one step:

awk 'BEGIN{ORS="";OFS="";}{gsub(">","",$1); if(NR%2==0) {print "\t",$1,"\n"} else {print "\t",$1}}' foo.fa
ADD COMMENTlink written 6.1 years ago by Devon Ryan94k
5

awk '{x=substr($0,2);getline;print x"\t"$0;}' foo.fa

ADD REPLYlink written 6.1 years ago by lh332k

Nice, I guess i have a penchant for verbosity :P

ADD REPLYlink written 6.1 years ago by Devon Ryan94k

this one is cool!

ADD REPLYlink written 6.1 years ago by Pavel Senin1.9k

Thank you both! The commands work well!

ADD REPLYlink written 6.1 years ago by biolab1.2k
4
gravatar for Kenosis
6.1 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Here's another option:

perl -pne 's/>(.+)[\r\n]/$1\t/' foo.fa

Output on your dataset:

miR156a    GACAGAA
miR156b    GACAGAA
miR156c    GACAGAA
ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k
3
gravatar for Vivek Krishnakumar
6.1 years ago by
Rockville, MD
Vivek Krishnakumar370 wrote:

Since TMTOWTDI ;), here is another Perl-based method, which does not assume that the FASTA sequence is located in one single line following the header:

perl -076 -l12 -ne 'next unless /\w/; chomp; @b = split /\n/; $h = shift @b; $s = join "", @b; print "$h\t$s";' IN > OUT

Here is how it works:

-0 76  : Sets the IFS as ">" (which is `76` in octal format) so that you can iterate through chunks of FASTA sequences
-l 12  : Sets the OFS as "\n" (which is `12` in octal format) and performs automatic line ending processing
-n     : Specifies that the script should automatically loop through every available chunk, separated by IFS. 
-e     : Tells the perl interpreter that the following text is a line of perl code

next unless /\w/; -> Skips any chunk that does not contain data (which is essentially the first chunk, preceding the first occurrence of the ">" symbol)
chomp;            -> Removes any traces of the IFS from the chunk being processed
@b = split /\n/;  -> Splits the chunk into an array, at every newline char
$h = shift @b;    -> Extracts first element of array which is the FASTA header
$s = join "", @b; -> Joins the rest of the array elements into a string, which corresponds to the sequence
print "$h\t$s";   -> Prints out the header and the sequence delimited by a tab
ADD COMMENTlink written 6.1 years ago by Vivek Krishnakumar370
1

Good thought about posible multi-line sequences. Here's another option to handle that case:

perl -076 -nE 'chomp;s/(.+)\n/$1\t/;s/\n//g;say' foo.fa
ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k

thanks for the informative answer.

ADD REPLYlink written 6.1 years ago by biolab1.2k
2
gravatar for Vivek
6.1 years ago by
Vivek2.4k
Denmark
Vivek2.4k wrote:
awk '{if(NR % 2 == 1) printf substr($0,2)"\t"; else print $0}' file.fa

Another variation with awk

ADD COMMENTlink written 6.1 years ago by Vivek2.4k
2

And with just a few minor changes (but none to your logic):

awk '{printf(NR%2)?substr($0,2)"\t":$0"\n"}' foo.fa
ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1246 users visited in the last hour