Question: How to deal with data set harmonization in R
2
gravatar for Lila M
3 months ago by
Lila M 800
UK
Lila M 800 wrote:

Hi everybody, I have a huge data set (from different locations) that I have to harmonized and do some data cleaning, as an example below:

  ID loc1_age proce_age loc1_dof proce_dof loc1_bmi proce_bmi  site4_wight pl3_length
   1    20     NA        1999       NA      54        NA         NA         NA
   2    15     NA        2000       NA      46        NA         NA         NA
   3    NA     9         NA         1998    NA        24         NA         NA
   4    NA     15        NA         2002    NA        36         NA         NA  
 ...
 505   NA     NA          NA         NA     NA       NA           78         NA         
 506   NA     NA          NA         NA     NA       NA           94         NA
 ...
1956   NA     NA         NA          NA    NA        NA          NA          176
1957   NA     NA         NA          NA    NA        NA          NA          189

final data:

       ID      age      dof    bmi  site4_wight pl3_length
       1        20      1999    54       NA        NA         
       2        15      2000    46       NA        NA
       3        9       1998    24       NA        NA
       4        15      2002    36       NA        NA
       ...
       505      NA     NA       NA      78        NA         
       506      NA     NA       NA      94        NA
       ...
       1956    NA     NA         NA      NA       176
       1957    NA     NA         NA      NA       189

For doing this, I'm using R, in more detail df %>% mutate(age = coalesce(loc1_age, proce_age))
My concern is, with a few of variables it works manually, but I have like around 41 variables to coalesce

#get common variables for two subset
sub_loc= sub("^loc_", "", loc1)
sub_proce= sub("^proce", "", proce)
ii = intersect(sub_loc, sub_proce) # this give to me 41 variables in common
length(ii)

what I've tried so far to implement a loop in R and create the new variables is this:

for (i in ii) {
df_populated <- df %>% 
 mutate(paste("loc", i, sep = "") = coalesce(paste("loc", i, sep = ""), paste("proce", i, sep = "")
}

But it doesn't work at all. Any ideas to do life easier with data cleaning? Thanks

ADD COMMENTlink modified 3 months ago by Friederike5.2k • written 3 months ago by Lila M 800
2

Is it correct to assume that every ID only has one value, i.e. if there's an entry in one of the location columns, all the other ones will have NA?

Then this could work:

> library(data.table)
> dt <- data.table(ID = c(1:5), loc1 = c(8, 5, rep(NA, 3)), loc2 = c(rep(NA, 2), 6, 9, 5))
> dt.melt <- melt.data.table(dt, id.vars = "ID")
> dt.melt
    ID variable value
 1:  1     loc1     8
 2:  2     loc1     5
 3:  3     loc1    NA
 4:  4     loc1    NA
 5:  5     loc1    NA
 6:  1     loc2    NA
 7:  2     loc2    NA
 8:  3     loc2     6
 9:  4     loc2     9
10:  5     loc2     5

# remove all entries with an NA
> dt.melt[ !is.na(value)]

or without data.table:

> df <- data.frame(ID = c(1:5), loc1 = c(8, 5, rep(NA, 3)), loc2 = c(rep(NA, 2), 6, 9, 5))
> df.melt <- reshape2::melt(df, id.vars = "ID")
> df.melt[!is.na(df$value),]
ADD REPLYlink modified 3 months ago • written 3 months ago by Friederike5.2k
1

Thank for the answer, but this is not what I asked. As I know how to merge the variables (I explained it in my post), I do not know how to do with a list of variables in one step.

ADD REPLYlink written 3 months ago by Lila M 800
1

The last line of the code above should be:

df.melt[!is.na(df.melt$value),]

then if produces the output you wrote you wanted from the example data you gave. If this is not what you want, then please clarify.

EDIT: I now understand you want to process multiple variables in the same way, eventually in one step. Repeating the given code for each variable in a loop should work.

ADD REPLYlink modified 3 months ago • written 3 months ago by Jean-Karim Heriche21k

But this do it only for one variable. I need to do it for a list of variables

ADD REPLYlink written 3 months ago by Lila M 800
1

I don't understand. The melt command will work irrespective of the number of additional columns beyond ID.

library(data.table)
> dt
   ID loc1 loc2 loc3
1:  1    8   NA   NA
2:  2    5   NA   NA
3:  3   NA    6   NA
4:  4   NA    9   NA
5:  5   NA    5   NA
6:  6   NA   NA    1
7:  7   NA   NA    2
8:  8   NA   NA    3

> dt.melt <- melt.data.table(dt, id.vars = "ID")
> dt.melt
    ID variable value
 1:  1     loc1     8
 2:  2     loc1     5
 3:  3     loc1    NA
 4:  4     loc1    NA
 5:  5     loc1    NA
 6:  6     loc1    NA
 7:  7     loc1    NA
 8:  8     loc1    NA
 9:  1     loc2    NA
10:  2     loc2    NA
11:  3     loc2     6
12:  4     loc2     9
13:  5     loc2     5
14:  6     loc2    NA
15:  7     loc2    NA
16:  8     loc2    NA
17:  1     loc3    NA
18:  2     loc3    NA
19:  3     loc3    NA
20:  4     loc3    NA
21:  5     loc3    NA
22:  6     loc3     1
23:  7     loc3     2
24:  8     loc3     3

> dt.melt[!is.na(value)]
   ID variable value
1:  1     loc1     8
2:  2     loc1     5
3:  3     loc2     6
4:  4     loc2     9
5:  5     loc2     5
6:  6     loc3     1
7:  7     loc3     2
8:  8     loc3     3
ADD REPLYlink modified 3 months ago • written 3 months ago by Friederike5.2k
3
gravatar for zx8754
3 months ago by
zx87548.2k
London
zx87548.2k wrote:

Edit:

To get coalesce over multiple variables, try below:

# example data
df <- read.table(text = "
ID loc1_age proce_age loc1_dof proce_dof loc1_bmi proce_bmi site4_wight pl3_length
1       20        NA     1999        NA       54        NA          NA         NA
2       15        NA     2000        NA       46        NA          NA         NA
3       NA         9       NA      1998       NA        24          NA         NA
4       NA        15       NA      2002       NA        36          NA         NA
505       NA        NA       NA        NA       NA        NA          78         NA
506       NA        NA       NA        NA       NA        NA          94         NA
1956       NA        NA       NA        NA       NA        NA          NA        176
1957       NA        NA       NA        NA       NA        NA          NA        189", header = TRUE)


library(dplyr)
library(tidyselect)

df %>% 
  transmute(ID, 
            age = coalesce(!!!syms(vars_select(names(df), ends_with("age")))),
            dof = coalesce(!!!syms(vars_select(names(df), ends_with("dof")))),
            bmi = coalesce(!!!syms(vars_select(names(df), ends_with("bmi")))),
            site4_wight, 
            pl3_length
  )

#     ID age  dof bmi site4_wight pl3_length
# 1    1  20 1999  54          NA         NA
# 2    2  15 2000  46          NA         NA
# 3    3   9 1998  24          NA         NA
# 4    4  15 2002  36          NA         NA
# 5  505  NA   NA  NA          78         NA
# 6  506  NA   NA  NA          94         NA
# 7 1956  NA   NA  NA          NA        176
# 8 1957  NA   NA  NA          NA        189

Original:

Pass columns that match certain pattern as string to coalesce:

# example data
df <- read.table(text = " ID location1_varx location2_varx  
1 8 NA                                              
2 5 NA            
3 NA 6             
4 NA 9              
5 NA 5", header = TRUE)


library(dplyr)
library(tidyselect)

df %>% 
  mutate(varx = coalesce(!!!syms(vars_select(names(df), 
                                             starts_with("location")))))
#   ID location1_varx location2_varx varx
# 1  1              8             NA    8
# 2  2              5             NA    5
# 3  3             NA              6    6
# 4  4             NA              9    9
# 5  5             NA              5    5
ADD COMMENTlink modified 3 months ago • written 3 months ago by zx87548.2k

But is not any way to do in a loop as I suggest in my post? The data set is a bit messy to use the method that you proposed.

ADD REPLYlink written 3 months ago by Lila M 800
2

R-programmers typically try to avoid for-loops.

Note how this solution only requires there to be "location" at the beginning of the column name. You can replace it with any pattern that may be appropriate.

You may need to share more details about why you think your data is messy.

ADD REPLYlink modified 3 months ago • written 3 months ago by Friederike5.2k
1

Just give an example of real data then otherwise, we're just guessing about what your real problem is.

ADD REPLYlink written 3 months ago by Jean-Karim Heriche21k

The thing is that I have around 2,000 variables, codified in different ways (sometimes the name of the variable match, sometimes not). When I get the match for the common variables in different locations (I have around 5 different locations, and not always starts with "location"). So for two location (example gave before) I get a total of 41 variables in common (stored in ii as reported before), I want to coalesce all of those variables for the two different location in another one, but I can't do it using starts_with("location")) because not always the location strats with it, and neither with melt, because it mixes the data set. Not sure if is more clear now.

ADD REPLYlink written 3 months ago by Lila M 800

maybe it would make sense to not try to do it all at once?

but I still don't fully understand your data set, so there's that :)

EDIT: If you're not willing to share your data set or at least its original structure (which is fine), then the ideas we've given you here may need to suffice to get you thinking about how to tweak if for your specific data set. Otherwise supply a more realistic toy example.

ADD REPLYlink modified 3 months ago • written 3 months ago by Friederike5.2k

Yes, It is original data and I can't do it. Not to do it all at once is not an option ;) For this intersection I have a list of 41 variables, but for other it could be 130 variables and so on. So this is way I ask for some solution in my for loop :)

ADD REPLYlink written 3 months ago by Lila M 800

Provide the representative example data.

The idea would be the same, we need to treat strings as variables. In your for loop, var we are trying to create is a string and columns we are passing to coalesce is string.

ADD REPLYlink written 3 months ago by zx87548.2k

It is confidential data and I can't share. This is way I ask for a solution in my code. Doing it for each variable once at time, works perfectly, but I would like to know how to introduce it in a loop in order to do it automatically for a list of variables.

ADD REPLYlink written 3 months ago by Lila M 800

No need for real data, representative dummy example would be enough, so we could try to adapt your forloop solution.

ADD REPLYlink written 3 months ago by zx87548.2k

edited my previous question. This may give you and idea of my data

ADD REPLYlink written 3 months ago by Lila M 800
1

Updated my post, see edit.

ADD REPLYlink written 3 months ago by zx87548.2k
3
gravatar for Friederike
3 months ago by
Friederike5.2k
United States
Friederike5.2k wrote:

I'm sure there's a more elegant solution with the tidyverse, but sticking to my original data.table-based proposal, one solution could look like this:

library(data.table)
library(magrittr)

> tt <- fread("tmp.txt", header = TRUE)

> tt
     ID loc1_age proce_age loc1_dof proce_dof loc1_bmi proce_bmi site4_wight pl3_length
1:    1       20        NA     1999        NA       54        NA          NA         NA
2:    2       15        NA     2000        NA       46        NA          NA         NA
3:    3       NA         9       NA      1998       NA        24          NA         NA
4:    4       NA        15       NA      2002       NA        36          NA         NA
5:  505       NA        NA       NA        NA       NA        NA          78         NA
6:  506       NA        NA       NA        NA       NA        NA          94         NA
7: 1956       NA        NA       NA        NA       NA        NA          NA        176
8: 1957       NA        NA       NA        NA       NA        NA          NA        189

# just reminding you of the basic pipe of functions I will string together per set of variables
> tt[, c("ID", grep("age", names(tt), value=TRUE)), with=FALSE] %>%  # extracing the ID column and the columns that are to be combined based on their shared property of having "age" in their label
        melt.data.table(., id.vars = "ID") %>% .[!is.na(value)]
   ID  variable value
1:  1  loc1_age    20
2:  2  loc1_age    15
3:  3 proce_age     9
4:  4 proce_age    15

# let's do the for-loop using the names that you showed you wanted to end up
> lapply(c("age","dof","bmi","site4_wight","pl3_length"), function(x){
  out <- tt[, c("ID", grep(x, names(tt), value=TRUE)), with=FALSE] %>% melt.data.table(., id.vars = "ID") %>% .[!is.na(value)]
  out$variable <- x
  return(out)
  }) %>% rbindlist %>% dcast.data.table(., ID ~ variable)
     ID age  dof bmi site4_wight pl3_length
1:    1  20 1999  54          NA         NA
2:    2  15 2000  46          NA         NA
3:    3   9 1998  24          NA         NA
4:    4  15 2002  36          NA         NA
5:  505  NA   NA  NA          78         NA
6:  506  NA   NA  NA          94         NA
7: 1956  NA   NA  NA          NA        176
8: 1957  NA   NA  NA          NA        189
ADD COMMENTlink modified 3 months ago • written 3 months ago by Friederike5.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 749 users visited in the last hour