Question

Looking For A Script To Reformat A Biological Data File

1

Entering edit mode

11.0 years ago

2011101101 ▴ 110

An example of the input and expected output format is below . Perhaps it's not a bioinformatic question, how can I convert it it with perl or awk?

miRNA_name    wsb1    wsb2    sm1    sm2    ph1    ph2    my1    my2
TCCCCACGGTCGGCGCCA    86    74    44    34    76    220    64    70
CGATTCCCCAGCGGAGTCGCCA    114    148    82    74    136    560    88    136
TTAGATGACCATCAGCAAACA    146    102    50    104    120    452    288    238
CTAATAGGGAACGTGAGCT    26    4    6    8    8    24    2    8
GATGCTTTGTGATTTGTGAATGCC    10    26    22    16    36    86    44    44

the result is

TCCCCACGGTCGGCGCCA    
wsb1    86
wsb2    74
 sm1    44
 sm2    34
 ph1    76
 ph2    220
 my1    64
 my2    70
CGATTCCCCAGCGGAGTCGCCA    
 wsb1    114
 wsb2    148
  sm1    82
  sm2    74
  ph1    136
  ph2    560
  my1    88
  my2    136

.....

format perl awk • 2.7k views

ADD COMMENT • link updated 11.0 years ago by Kenosis ★ 1.3k • written 11.0 years ago by 2011101101 ▴ 110

3

Entering edit mode

This is not really a bioinformatics question, but a simple perl text processing question. someone might help you if you provide a little more information on the formats, the conversion is in fact very easy using perl and awk. What is the software requiring and yielding such formats? Also, it is impossible to see exactly what the formatting is from your example, because of the invisible delimiters in the output (tab,space, how many). If you provide this information, the question might be of some value for others searching for the same format conversion, otherwise it is not relevant and should be closed.

ADD REPLY • link 11.0 years ago by Michael 54k

0

Entering edit mode

yes , the format is not clear,I have edit it ,and I have get the answer by myself,use awk,it's very easy.thank you all the same,and how to close it ?

ADD REPLY • link 11.0 years ago by 2011101101 ▴ 110

3

Entering edit mode

Perhaps you would like to post your solution, so others may learn from it?

ADD REPLY • link 11.0 years ago by Neilfws 49k

0

Entering edit mode

I found your formatting and implicit specs clear and understandable. I see no good reason to close your question.

ADD REPLY • link 11.0 years ago by Kenosis ★ 1.3k

score 6 · Answer 1 · 2013-04-17

It is not entirely clear to me what you want to do with the file you have. But assuming you want to reshape the data in the file you specified on top and you want to reshape it in the way you specified than use R:

d <- read.table("yourFile.txt",sep="\t",header=T)

library(reshape) # if you don't have installed it use install.packages("reshape") first

d <- melt(d,id.vars="miRNA_name")

score 5 · Answer 2 · 2013-04-17

Here's another Perl option:

use strict;
use warnings;

my @header = split ' ', <>;

while (<>) {
    my @line = /(\S+)/g;
    print "$line[0]\n";
    print "$header[$_]\t$line[$_]\n" for 1 .. @line - 1;
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs the output to a file.

Sample output on your dataset:

TCCCCACGGTCGGCGCCA
wsb1    86
wsb2    74
sm1    44
sm2    34
ph1    76
ph2    220
my1    64
my2    70
CGATTCCCCAGCGGAGTCGCCA
wsb1    114
wsb2    148
sm1    82
sm2    74
ph1    136
ph2    560
my1    88
my2    136
TTAGATGACCATCAGCAAACA
wsb1    146
wsb2    102
sm1    50
sm2    104
ph1    120
ph2    452
my1    288
my2    238
...

The header line is first split into @header. Then the subsequent file's lines are processed into array @lines, where $lines[0] contains the sequence that's printed as a header under which elements $header[1 .. n] are correspondingly printed with elements $line[1 .. n].

Hope this helps!

score 2 · Answer 3 · 2013-04-17

This can be done using a perl 'Hash of Hash' and for a tab delimited data file

use strict;
use warnings;
my %hoh=();
my $datafile="data.txt";
open FH,$datafile||die($!);
while(<FH>){
chomp $_;
my @fields=split(/\t/,$_);
$hoh{$fields[0]}{wsb1}=$fields[1];
$hoh{$fields[0]}{wsb2}=$fields[2];
$hoh{$fields[0]}{sm1}=$fields[3];
$hoh{$fields[0]}{sm2}=$fields[4];
$hoh{$fields[0]}{ph1}=$fields[5];
$hoh{$fields[0]}{ph2}=$fields[6];
$hoh{$fields[0]}{my1}=$fields[7];
$hoh{$fields[0]}{my2}=$fields[8];
}
close FH||die($!);
my @g_array=qw/wsb1 wsb2 sm1 sm2 ph1 ph2 my1 my2/;

for my $rna ( sort keys %hoh ) {
    print $rna,"\n";
    map{print $_,"\t",$hoh{$rna}{$_},"\n";}@g_array;
}