diFST with awk - subtract and divide column value by their average and stdev
0
0
Entering edit mode
4.8 years ago

Hi everyone,

I am trying to create a simple script to sum z-scores of several pairwise FST comparisons (basically, replicate the diFST statistic from this paper: http://www.pnas.org/content/107/3/1160.full)

I already found awk scripts to calculate average and standard deviation of a field, but I was looking for something more specific: integrate these values within the script itself. It would be something such as this:

DATASET
0.19393939  0.00000000  0.00000000
0.00000000  0.00000000  0.07407407
0.13658455  0.08052603  0.06750392
0.03704075  0.09428103  0.05263158
0.06046107  0.02883006  0.05263158
0.07591252  0.03383722  0.01593773
0.10475719  0.02040816  0.09956710
0.05010618  0.12556459  0.11561691
0.09554731  0.24450549  0.17279412
0.04499541  0.04938272  0.03703704

MOCK-UP SCRIPT
awk '{print ($1-(average$1))/(stdev$1) + ($2-(average$2))/(stdev$2) + ($3-(average$3))/(stdev$3)}' OUTPUT -0.2179645555 -2.2434375436 1.1659017773 -0.7342745412 -1.1920991459 -1.5720200178 0.419356693 1.1658334403 4.7111786566 -1.5024747631  If you could help with just the awk '{print ($1-(average$1))/(stdev$1)}' the rest I can easily do. Thank you for the help!

awk unix average standard deviation z-score • 2.5k views
0
Entering edit mode

0
Entering edit mode

Hi Pierre, the reason I asked for awk is because I'm used to doing most simple calculations in this language.

0
Entering edit mode

R is good for handling row-oriented data, it's not the tool of choice for columns/whole-file.

0
Entering edit mode

This outputs the mean, variance, and sd. I'm not sure what you are doing after that (?)

awk '{sum=0; meandiff=0; variance=0; sd=0; z=0; for(i=1; i<=NF; i++) {sum+=$i; sumSq+=($i*$i)}; mean=sum/NF; for(i=1; i<=NF; i++) {meandiff+=(($i-mean)**2)}; variance=(meandiff/(NF-1)); sd=sqrt(variance); print mean"\t"variance"\t"sd}' DATASET

0
Entering edit mode

Hi Kevin,

My objective is to incorporate the mean and standard deviation of the column into the formula itself. For example, given a column $1, do: ($1-average$1)/(stdev$1)

Basically, I want to convert each value of \$1 into a zscore (for example, following this: https://statistics.laerd.com/statistical-guides/standard-score-2.php). I know how to do this in several steps, first calculating the mean and standard deviation separately for the column and then incorporate these values into awk, but I imagine there must be a way to do this in a single step.

1
Entering edit mode

I see, I wrote my awk code above for summarising rows (not columns).

It may be complex in awk as you'd have to read over the file multiple times. The syntax in awk or that is: awk 'NR=FNR {first pass actions; next} {second pass actions}' DATASET DATASET

As Pierre said, what about R?

Here's a R script that should be executable in your shell that will convert your data to Z scores:

#!/usr/bin/env Rscript
write.table(result, "DATASET.Z.tsv", row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
result
V1         V2          V3
[1,]  1.1547005 -0.5773503 -0.57735027
[2,] -0.5773503 -0.5773503  1.15470054
[3,]  1.1363896 -0.3908140 -0.74557564
[4,] -0.8203855  1.1139155 -0.29352998
[5,]  0.7984454 -1.1216241  0.32317870
[6,]  1.1048403 -0.2617382 -0.84310206
[7,]  0.6313757 -1.1529593  0.52158360
[8,] -1.1461710  0.6944072  0.45176383
[9,] -1.0121541  0.9873858  0.02476832
[10,]  0.1902031  0.8912387 -1.08144180

1
Entering edit mode

Hi Kevin,

Thanks for the script! That does exactly what I needed!