Question

Use awk or other command to divide each column in a data file by it's first number

1

Entering edit mode

6.1 years ago

twesteele ▴ 30

I need to divide each column in a data text file by the first number in each column and then print the divided columns into a new file for the purposes of analyzing some calcium imaging data.

For example, if I have a file, data1.txt:

0.269279025887517   0.264783938118797   0.275486451749212  
0.270740157930877   0.253012512397955   0.280354009308003  
0.265142481116960   0.258302433813993   0.280007507438773  
0.265476939803159   0.261876223900715   0.278089570458534  
0.267494373235676   0.265344090943771   0.274431860837720  
0.275140192263676   0.261608300907912   0.275908445868620  
0.271029888609140   0.267924518705018   0.276844983596552  
0.269824902723735   0.267923628595407   0.272863416495003  
0.271370355535210   0.266164772004781   0.272643808651865

I need an awk script to give me a resulting file data2.txt:

1.000000000000000   1.000000000000000   1.000000000000000  
1.005426089308460   0.955543278778637   1.017668954418210  
0.984638444242275   0.975521535215266   1.016411172530820  
0.985880496738182   0.989018540026482   1.009449171430370  
0.993372478060781   1.002115509078660   0.996171895551321  
1.021766145197690   0.988006684871269   1.001531814420380  
1.006502038975570   1.011860918032010   1.004931392591950  
1.002027179184930   1.011857556386980   0.990478532655403  
1.007766403791760   1.005214945799940   0.989681368795825

The number of columns can be as many as 50, and the number of rows also varies and I also need it printed out to 15 decimal places and tab delimited. So far I have:

awk 'BEGIN{first_line=0;divide_by=1;}{if(first_line==0){first_line=1; divide_by=$2;}print $2/divide_by;}END{}' data1.txt > data2.txt

which only gives me the second column and gives the resulting numbers to only 6 decimal places. I need ithe script to do that for all of the columns in any given data file.

Context: I'm trying to analyze data from cells imaged using a fluorescent calcium dye and need to normalize the data columns by the average of their baseline (the first number).)

awk • 6.1k views

ADD COMMENT • link updated 6.1 years ago by zx8754 11k • written 6.1 years ago by twesteele ▴ 30

score 2 · Answer 1 · 2018-04-04

2

Entering edit mode

6.1 years ago

Alex Reynolds 35k

Use printf and %f format specifiers to add desired precision (within what awk allows by default). Use split to work with input containing an arbitrary number of columns:

$ awk '{ if (NR==1) { n = split($0, a, "\t"); } for (i=1; i<n; i++) { printf("%.15f\t", ($i/a[i])); } printf("%.15f\n", ($n/a[n])); }' in.mtx > out.mtx

Then:

$ more out.mtx
1.000000000000000   1.000000000000000   1.000000000000000
1.005426089308457   0.955543278778637   1.017668954418209
0.984638444242275   0.975521535215266   1.016411172530825
0.985880496738182   0.989018540026482   1.009449171430368
0.993372478060781   1.002115509078661   0.996171895551321
1.021766145197685   0.988006684871269   1.001531814420377
1.006502038975566   1.011860918032014   1.004931392591955
1.002027179184932   1.011857556386979   0.990478532655403
1.007766403791755   1.005214945799939   0.989681368795825

Another way to do this is with Python:

#!/usr/bin/env python

import sys

first_line = True
for line in sys.stdin:
    signal = line.rstrip().split('\t')
    if first_line:
        denoms = signal
        first_line = False
    rescaled = [float(a)/float(b) for a,b in zip(signal, denoms)]
    sys.stdout.write("%s\n" % ('\t'.join([str("%.15f" % (x)) for x in rescaled])))

Then:

$ ./rescale_mtx.py < in.mtx > out.mtx

This, too, will scale to an arbitrary number of columns.

Finally, here's a version in Perl:

#!/usr/bin/env perl

use strict;
use warnings;

use constant false => 0;
use constant true  => 1;

my $first_line = true;
my @denoms = ();
foreach my $line (<STDIN>) {
    chomp $line;
    my @signal = split '\t', $line;
    if ($first_line) {
        @denoms = @signal;
        $first_line = false;
    }
    my @rescaled = map {$signal[$_]/$denoms[$_]} 0..$#signal;
    foreach my $idx (0..$#rescaled) {
        printf('%.15f%c', $rescaled[$idx], ($idx < $#rescaled ? ord("\t") : ord("\n")));
    }
}

This works much like the Python script:

$ ./rescale_mtx.pl < in.mtx > out.mtx

As with the Python and awk options, this will scale to arbitrary columns.

ADD COMMENT • link 6.1 years ago by Alex Reynolds 35k

0

Entering edit mode

This does seem like it is what I am after, but only gives me the first column.

ADD REPLY • link 6.1 years ago by twesteele ▴ 30

0

Entering edit mode

Check that your input is tab-delimited, since the use of split here is expecting to work on tab-delimited input.

One way to do this is to use cat -te; for example:

$ cat -te in.mtx
0.269279025887517^I0.264783938118797^I0.275486451749212$
0.270740157930877^I0.253012512397955^I0.280354009308003$
0.265142481116960^I0.258302433813993^I0.280007507438773$
0.265476939803159^I0.261876223900715^I0.278089570458534$
0.267494373235676^I0.265344090943771^I0.274431860837720$
0.275140192263676^I0.261608300907912^I0.275908445868620$
0.271029888609140^I0.267924518705018^I0.276844983596552$
0.269824902723735^I0.267923628595407^I0.272863416495003$
0.271370355535210^I0.266164772004781^I0.272643808651865$

Tabs should be shown between numbers as ^I and lines should be terminated with $ characters.

If that's not the case for you, then you need to clean or reformat your input. For example, using tr or sed or awk, etc.

If you're using OS X, also make sure you're using GNU awk. One way to do this is to install Homebrew and then run brew install gawk. Replace use of awk with gawk to get an authentically GNUish awk experience.

If you're using OS X and want to clean the input file, the BSD version of sed with OS X behaves differently from what is shipped in Linux. I have a blog post about cleaning up multiple spaces with single tabs with BSD sed.

ADD REPLY • link 6.1 years ago by Alex Reynolds 35k

0

Entering edit mode

This is really great, thank you Alex. Still struggling to get the columns tab delimited, but I'm coming down the home stretch.

ADD REPLY • link 6.1 years ago by twesteele ▴ 30

2

Entering edit mode

Cleaning input seems to be 95% of bioinformatics. Good luck and let us know how it goes.

ADD REPLY • link 6.1 years ago by Alex Reynolds 35k

score 1 · Answer 2 · 2018-04-04

Using R:

# Use below to read from file
# df1 <- read.table("myInput.txt")

# Using an example input as text:
df1 <- read.table(text = "
0.269279025887517   0.264783938118797   0.275486451749212  
0.270740157930877   0.253012512397955   0.280354009308003  
0.265142481116960   0.258302433813993   0.280007507438773  
0.265476939803159   0.261876223900715   0.278089570458534  
0.267494373235676   0.265344090943771   0.274431860837720  
0.275140192263676   0.261608300907912   0.275908445868620  
0.271029888609140   0.267924518705018   0.276844983596552  
0.269824902723735   0.267923628595407   0.272863416495003  
0.271370355535210   0.266164772004781   0.272643808651865")

output <- mapply("/", df1, df1[1, ])

# check output
output
#             V1        V2        V3
# [1,] 1.0000000 1.0000000 1.0000000
# [2,] 1.0054261 0.9555433 1.0176690
# [3,] 0.9846384 0.9755215 1.0164112
# [4,] 0.9858805 0.9890185 1.0094492
# [5,] 0.9933725 1.0021155 0.9961719
# [6,] 1.0217661 0.9880067 1.0015318
# [7,] 1.0065020 1.0118609 1.0049314
# [8,] 1.0020272 1.0118576 0.9904785
# [9,] 1.0077664 1.0052149 0.9896814

# Note: Decimals are not truncated, this is just a default print in R.
#       Output file should have all the decimals

# write to a file
write.table(output, file = "myOutput.txt", col.names = FALSE, row.names = FALSE)

score 1 · Answer 3 · 2018-04-04

Putting my own solution in AWK back up here, purely for my own future reference (when I search for some code that I need!).

This works with any number of columns and they can be either space- or tab-delimited:

awk '{if (NR==1) {for (i=1; i<=NF; i++) denominator[i]=$(i)} for (i=1; i<=NF; i++) if (NF==i) {printf $(i)/denominator[i]"\n"} else {printf $(i)/denominator[i]"\t"}}' FS=' ' data1.txt
1           1           1
1.00543     0.955543    1.01767
0.984638    0.975522    1.01641
0.98588     0.989019    1.00945
0.993372    1.00212     0.996172
1.02177     0.988007    1.00153
1.0065      1.01186     1.00493
1.00203     1.01186     0.990479
1.00777     1.00521     0.989681