Calculating average using information from three different columns of a file.
2
0
Entering edit mode
5.0 years ago
bk11 ★ 1.2k

Hi

I have several thousand lines sheet with columns like this:

Count   Cell_Types  FPKM    transcript_name
1   alternatively activated macrophage  0.02    AP002954.3
2   alternatively activated macrophage  0   AP002954.3
3   alternatively activated macrophage  0.02    AP002954.3
4   alternatively activated macrophage  0.03    AP002954.3
5   alternatively activated macrophage  0.01    AP002954.3
6   alternatively activated macrophage  0.04    AP002954.3
7   alternatively activated macrophage  0   AP002954.3
272 alternatively activated macrophage  62.54   ARCN1
273 alternatively activated macrophage  63.81   ARCN1
274 alternatively activated macrophage  59.45   ARCN1
275 alternatively activated macrophage  60.11   ARCN1
276 alternatively activated macrophage  65.07   ARCN1
277 alternatively activated macrophage  61.61   ARCN1
278 alternatively activated macrophage  46.62   ARCN1
542 alternatively activated macrophage  199.45  ATP5L
543 alternatively activated macrophage  212.03  ATP5L
544 alternatively activated macrophage  195.81  ATP5L
545 alternatively activated macrophage  266.32  ATP5L
546 alternatively activated macrophage  218.48  ATP5L
547 alternatively activated macrophage  216 ATP5L
548 alternatively activated macrophage  227.52  ATP5L


I want to take average of FPKM for individual transcript_names from individual Cell_types and get result like this:

Count   Cell_Types  Average_FPKM    transcript_name
1   alternatively activated macrophage  0.017   AP002954.3
2   alternatively activated macrophage  59.887  ARCN1
3   alternatively activated macrophage  219.372 ATP5L


I will appreciate any help. Thanks.

Unix R Python • 858 views
0
Entering edit mode

with datamash:

$datamash -Hsg 2,4 mean 3 < test.txt GroupBy(Cell_Types) GroupBy(transcript_name) mean(FPKM) alternatively activated macrophage AP002954.3 0.017142857142857 alternatively activated macrophage ARCN1 59.887142857143 alternatively activated macrophage ATP5L 219.37285714286  with tsv-utils: $ tsv-summarize --header --group-by 2,4 --mean 3 test.txt  | nl -v 0 -w 1 | sed '1s/0/S.No/'

S.No    Cell_Types  transcript_name FPKM_mean
1   alternatively activated macrophage  AP002954.3  0.0171428571429
2   alternatively activated macrophage  ARCN1   59.8871428571
3   alternatively activated macrophage  ATP5L   219.372857143

1
Entering edit mode
5.0 years ago

Hello,

this can be done with datamash for grouping your data and awkfor formating the output:

$datamash -H -f -R 3 -g4 mean 3 < input.txt|awk -v FS="\t" -v OFS="\t" 'NR==1 {print$1,$2,"Average_FPKM",$4; next} {print NR-1,$2,$5,\$4}'


fin swimmer

0
Entering edit mode
5.0 years ago
zx8754 11k

Using R, aggregate:

aggregate(.~ Cell_Types + transcript_name, df1[ , -1], mean)

#                           Cell_Types transcript_name         FPKM
# 1 alternatively activated macrophage      AP002954.3   0.01714286
# 2 alternatively activated macrophage           ARCN1  59.88714286
# 3 alternatively activated macrophage           ATP5L 219.37285714


See SO for other alternative solutions: Mean per group in a data.frame