Question: Standardise values in R data frame
1
paolo002160 wrote:

Hi all

I have a data frame with various SNPs and other columns with some information and values such as the following:

``````SNP ID.      location    chromosome    values column A.        values column B
rs8662689     78654         1             0.6432                    0.2458
rs753279     1009753        7            -1.6434                    1.9876
rs4331780     2086433       22            4.521                    -3.743
and so on......                 .....                 ......
``````

I would like to standardise the values in column B. I understand I have to divide the column B in various frequency bins ( I have 20x10^6 rows so I guess I need to set a bin value by which to divide, for instance 1000). Then I would like to calculate the mean of each bin and the standard deviation and divide each value in column B by that particular mean and standard deviation calculated previously in each bin and create a new column. Anybody knows how to do this by a R code?

I have written something like this but it does not seem to work, might be wrong:

``````library(dplyr)
n_bins = 1000
outscore = df %>% mutate(bin=ntile(mean(df\$valuesB),n_bins)) %>%
group_by(bin) %>% mutate(zscore=scale(mean()),outlier=abs(zscore)>1.7)
``````

Any help highly appreciated. Thanks

R • 494 views
modified 2.2 years ago • written 2.2 years ago by paolo002160

for z score calculation, R has scale function.

Yes, I know, but if I apply only scale to the column it calculates the scores for each value of column B but it is not dividing by frequency bins

Not clear what you want to do, maybe provide example input data, and expected output? Also, this `ntile(mean(df\$valuesB),n_bins)` creates only 1 bin, and this `scale(mean())` meant to be `scale(mean(valuesB))` ?

Apologies for the not so clear explanation Here is a sample of my data, first 20 rows:

``````      Location  iHH_A1  iHH_B1  iHH_P1    XPEHH
1  rs117712813 16204.6 12744.9 13735.6 0.240170
2  rs186953131 16276.3 13073.5 13948.8 0.219126
3    10:127358 16276.3 13073.5 13948.8 0.219126
4    10:127758 16276.3 13073.5 13948.8 0.219126
5   rs78617290 16276.3 12984.2 13895.4 0.225974
6  rs111769723 16276.3 13006.1 13907.8 0.224292
7    rs9419541 15830.9 12618.6 13501.5 0.226793
8    rs2928098 14876.2 11812.6 12623.9 0.230596
9  rs113714579 15722.9 12533.7 13405.7 0.226698
10   rs2928100 15398.9 12197.0 13079.0 0.233101
11   10:128988 16814.9 13470.8 14409.6 0.221744
12   10:129015 16814.9 13470.8 14409.6 0.221744
13   rs9419543 16771.4 13330.5 14294.4 0.229620
14 rs117825552 16898.4 13300.9 14311.2 0.239391
15 rs111736399 16970.0 13577.1 14505.8 0.223062
16   10:129400 16970.0 13494.7 14448.5 0.229148
17 rs144184182 16970.0 13529.7 14467.8 0.226555
18 rs189429487 16970.0 13529.7 14467.8 0.226555
19 rs111738234 16970.0 13529.7 14467.8 0.226555
20 rs113712674 16970.0 13529.7 14467.8 0.226555
``````

I would like to standardise the values of the XPEHH column. I guess from what I understand I need to divide the values in bins and calculate the z-scores on the separate bins. So in this example for instance if I group by 10 bins I calculate the mean of the first 10 values and their standard deviation and calculate the z-score of those 10 SNPs based on the mean and SD previously calculated. The following is again done on the following 10 values/SNPs. Hope that is a bit clearer. Yes, I meant scale(mean(valuesB)) If I do only df\$zscores<-scale(df\$XPEHH) it calculates the zscore globally, I want to calculates the zscores on the separate bins.

Thank you very much, I think I can see this is most likely what I need because in the bin column it is grouping SNPs with similar values and then it calculates the zscore. Thanks a lot.

1
zx87549.9k wrote:

Try this exampe, see if that is what you need:

``````library(dplyr)

n_bins = 2

#example data
df <- read.table(text = "      Location  iHH_A1  iHH_B1  iHH_P1    XPEHH
1  rs117712813 16204.6 12744.9 13735.6 0.240170
2  rs186953131 16276.3 13073.5 13948.8 0.219126
3    10:127358 16276.3 13073.5 13948.8 0.219126
4    10:127758 16276.3 13073.5 13948.8 0.219126
5   rs78617290 16276.3 12984.2 13895.4 0.225974
6  rs111769723 16276.3 13006.1 13907.8 0.224292
7    rs9419541 15830.9 12618.6 13501.5 0.226793
8    rs2928098 14876.2 11812.6 12623.9 0.230596
9  rs113714579 15722.9 12533.7 13405.7 0.226698
10   rs2928100 15398.9 12197.0 13079.0 0.233101
11   10:128988 16814.9 13470.8 14409.6 0.221744
12   10:129015 16814.9 13470.8 14409.6 0.221744
13   rs9419543 16771.4 13330.5 14294.4 0.229620
14 rs117825552 16898.4 13300.9 14311.2 0.239391
15 rs111736399 16970.0 13577.1 14505.8 0.223062
16   10:129400 16970.0 13494.7 14448.5 0.229148
17 rs144184182 16970.0 13529.7 14467.8 0.226555
18 rs189429487 16970.0 13529.7 14467.8 0.226555
19 rs111738234 16970.0 13529.7 14467.8 0.226555
20 rs113712674 16970.0 13529.7 14467.8 0.226555", header = TRUE)

res <- df %>%
mutate(bin = ntile(XPEHH, n_bins)) %>%
group_by(bin) %>%
mutate(zscore = scale(XPEHH),
outlier = abs(zscore) > 1.7) %>%
ungroup()

res
# # A tibble: 20 x 8
#    Location    iHH_A1 iHH_B1 iHH_P1 XPEHH   bin  zscore outlier
#    <fct>        <dbl>  <dbl>  <dbl> <dbl> <int>   <dbl> <lgl>
#  1 rs117712813 16205. 12745. 13736. 0.240     2  1.80   TRUE
#  2 rs186953131 16276. 13074. 13949. 0.219     1 -1.19   FALSE
#  3 10:127358   16276. 13074. 13949. 0.219     1 -1.19   FALSE
#  4 10:127758   16276. 13074. 13949. 0.219     1 -1.19   FALSE
#  5 rs78617290  16276. 12984. 13895. 0.226     1  1.07   FALSE
#  6 rs111769723 16276. 13006. 13908. 0.224     1  0.513  FALSE
#  7 rs9419541   15831. 12619. 13502. 0.227     2 -0.788  FALSE
#  8 rs2928098   14876. 11813. 12624. 0.231     2 -0.0517 FALSE
#  9 rs113714579 15723. 12534. 13406. 0.227     2 -0.807  FALSE
# 10 rs2928100   15399. 12197  13079  0.233     2  0.433  FALSE
# 11 10:128988   16815. 13471. 14410. 0.222     1 -0.324  FALSE
# 12 10:129015   16815. 13471. 14410. 0.222     1 -0.324  FALSE
# 13 rs9419543   16771. 13330. 14294. 0.230     2 -0.241  FALSE
# 14 rs117825552 16898. 13301. 14311. 0.239     2  1.65   FALSE
# 15 rs111736399 16970  13577. 14506. 0.223     1  0.109  FALSE
# 16 10:129400   16970  13495. 14448. 0.229     2 -0.332  FALSE
# 17 rs144184182 16970  13530. 14468. 0.227     1  1.26   FALSE
# 18 rs189429487 16970  13530. 14468. 0.227     1  1.26   FALSE
# 19 rs111738234 16970  13530. 14468. 0.227     2 -0.834  FALSE
# 20 rs113712674 16970  13530. 14468. 0.227     2 -0.834  FALSE
``````