Question: Converting A File With Rows And Columns To Just Columns
gravatar for mphillips6789
6.4 years ago by
United States
mphillips678910 wrote:

I have a file with entries that look like this:

Pos/Line    p

A    0
C    0
G    0.081985
T    0.918015

A    0.021697
C    0.978303
G    0
T    0

I need to convert this to something that looks like:

Pos    A    C    G    T
148    0    0    0.081985    0.918015
207    0.021697    0.978303    0    0

So, my "Pos" entries are more or less already in a column. However, I need to convert the A, C, G, T rows to columns.

Any help would be appreciated.

ADD COMMENTlink modified 6.4 years ago by wdiwdi380 • written 6.4 years ago by mphillips678910
gravatar for seidel
6.4 years ago by
United States
seidel7.1k wrote:

This is a textbook of example of one of the reasons perl was created. If your file is completely regular you could write a few lines of perl to loop through the file and do something whenever it encounters a line starting with a number. For instance:

print "Pos\tA\tC\tG\tT\n";
        $number = $_;
        @values = ($number);
        for($i=0; $i<4;++$i){
            $_ = <>;
            ($base,$value) = split();
        print join("\t", @values), "\n";

The code above would work to parse your little snippet, and format it the way you've shown above. But it assumes your file is structured in a completely regular way. If the code above were in a file called, and your data was in a file called foo.txt, you would call it like so:

./ foo.txt

and to dump the results to a new file:

./ foo.txt > newfile.txt

If you're unfamiliar with perl, here's what's happening: print a header line like you have above, then loop through the file one line at a time, the <> symbols grab a line from the file and place it into a variable called: $_. The chop function cuts off the last character of the line (the "newline"). The if statement tests to see if the line begins with 1 or more digits (many functions like chop, split, pattern matching, etc. operate on $_ implicitly unless another variable is handed to them explicitly). If the line begins with digits, remember the digit, and start a list of values. Grab the next line, which should be empty, and don't save it to anything (thus discarding it). Then set up a loop to process the next four lines: remove the end character, split each line by white space saving the values, and push each value onto the list of values that was created previously. After 4 lines, print the contents of the list, joined by a tab character, followed by a newline. Repeat until there are no more lines in the file!

There are a variety of ways to solve your problem. An awk solution would also be easy to code. But with a few principles from perl that could be learned in an afternoon or two, you can reshape your file. (some gurus might find the code above cringe worthy, but it gets the job done).

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by seidel7.1k
gravatar for wdiwdi
6.4 years ago by
wdiwdi380 wrote:

The Perl solution is overkilll. This problem can be solved in a more readable fashion with a tiny awk script:

 BEGIN { print("Pos\tA\tC\tG\tT") }
  /^[0-9]/        { printf("%s\t",$1)}
  /^[ACG]/ { printf("%s\t",$2)}
  /^T/    { printf("%s\n",$2)}

run as "gawk -f myscript.awk myinfile.txt >myoutfile.txt"

ADD COMMENTlink written 6.4 years ago by wdiwdi380

No solution is overkill or readable if one knows of no other solution or language. I think we can safely assume mphillips6789 knows neither awk nor perl (I did mention awk as a possibility in my response). For the edification of those who know neither, the notion of readability is interesting, and they shouldn't miss the common elements, the idea of using // to specify patterns to match by line, {} to hold blocks of code, and putting things in variables starting with $.

ADD REPLYlink written 6.4 years ago by seidel7.1k

Thank you, problem solved. Looking at both solutions was educational in and of itself.

ADD REPLYlink written 6.4 years ago by mphillips678910
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1358 users visited in the last hour