Question

calculate the average of column 7 if the rows of column 2 have the same value

0

Entering edit mode

4.4 years ago

Hann ▴ 110

Hello,

I am trying to use awk command to calculate the average of the values in column 7, if the rows of column 2 have the same value

example data:

#CHROM  wind_start      wind_end        CHROM   POS     POS     relative_t_o    relative_o_t
Dexi_CM05836_chr09A     1       10000   Dexi_CM05836_chr09A     1250    1250    1       1
Dexi_CM05836_chr09A     1       10000   Dexi_CM05836_chr09A     2215    2215    1       1
Dexi_CM05836_chr09A     1       10000   Dexi_CM05836_chr09A     2278    2278    1       1
Dexi_CM05836_chr09A     10001   20000   Dexi_CM05836_chr09A     10452   10452   1       1.095238095
Dexi_CM05836_chr09A     40001   50000   Dexi_CM05836_chr09A     46251   46251   1       1.047619048
Dexi_CM05836_chr09A     40001   50000   Dexi_CM05836_chr09A     41892   41892   1       1
Dexi_CM05836_chr09A     110001  120000  Dexi_CM05836_chr09A     109479  109479  1       0.673992674
Dexi_CM05836_chr09A     140001  150000  Dexi_CM05836_chr09A     141093  141093  0.913043478     0.727272727
Dexi_CM05836_chr09A     140001  150000  Dexi_CM05836_chr09A     141446  141446  1       1

so if the column wind_stat has value 1 in the rows then it will calculate the average of column 7 and so on:

wind_start average_relative_t_o
1   1
10001   1
40001   1
110001  1
140001  0.955

result will be I started with this:

awk '{b[$2] total +=$7}END {print total/NR}} END { for (i in b) { print b[i],i } } '  file

I know that this will calculate the average:

awk '{ total += $7 } END { print total/NR }'  file

Any help would be appreciated

bash shell • 2.9k views

ADD COMMENT • link 4.4 years ago by Hann ▴ 110

2

Entering edit mode

with datamash:

$ datamash  -sH -g 2 mean 7 < test.txt | sort -k 1n
GroupBy(wind_start) mean(relative_t_o)
1   1
10001   1
40001   1
110001  1
140001  0.956521739

ADD REPLY • link 4.4 years ago by cpad0112 21k

0

Entering edit mode

it's a nice way to do it but I don't know why for some reason, it jumps after 30 kb

example:

1   0.9776751
10001   1.05638587
20001   1.066900363
30001   1.066353926  << from here it jumps to 100001(70 kb instead of 10kb)
100001  0.995859213
110001  1.026932854
120001  1.108739247
130001  1.029039356
140001  1.009835149
150001  1.047956924
160001  1.015176368
170001  1.012892992
180001  0.9912546
......
.....
....
270001  0.973239006
280001  0.984189723
290001  1.069620114
300001  1.1592556  << same here
1000001 0.982608696 ..... and so on

ADD REPLY • link 4.4 years ago by Hann ▴ 110

0

Entering edit mode

I am not sure what is going on. Probably it is not sorted proper. Could you please post a few records at the place record jumping (for eg 10 records that side and 10 records this side with identical 2nd column values)? @ haneenih7

if sorting is the issue, try this with tsv-utils:

$ datamash  -sH -g 2 mean 7 < test.txt | keep-header -- sort -k1,1n

ADD REPLY • link 4.4 years ago by cpad0112 21k

score 0 · Answer 1 · 2019-11-11

0

Entering edit mode

4.4 years ago

Hann ▴ 110

It worked with this awk command line:

awk '{date1[$2]+=$7;++date2[$2]}END{for (key in date1) print key  , date1[key]/date2[key]}'  file.txt > test.txt

ADD COMMENT • link 4.4 years ago by Hann ▴ 110