Hi everybody, I have a huge data set (from different locations) that I have to harmonized and do some data cleaning, as an example below:
  ID loc1_age proce_age loc1_dof proce_dof loc1_bmi proce_bmi  site4_wight pl3_length
   1    20     NA        1999       NA      54        NA         NA         NA
   2    15     NA        2000       NA      46        NA         NA         NA
   3    NA     9         NA         1998    NA        24         NA         NA
   4    NA     15        NA         2002    NA        36         NA         NA  
 ...
 505   NA     NA          NA         NA     NA       NA           78         NA         
 506   NA     NA          NA         NA     NA       NA           94         NA
 ...
1956   NA     NA         NA          NA    NA        NA          NA          176
1957   NA     NA         NA          NA    NA        NA          NA          189
final data:
       ID      age      dof    bmi  site4_wight pl3_length
       1        20      1999    54       NA        NA         
       2        15      2000    46       NA        NA
       3        9       1998    24       NA        NA
       4        15      2002    36       NA        NA
       ...
       505      NA     NA       NA      78        NA         
       506      NA     NA       NA      94        NA
       ...
       1956    NA     NA         NA      NA       176
       1957    NA     NA         NA      NA       189
For doing this, I'm using R, in more detail    df %>% mutate(age = coalesce(loc1_age, proce_age)) 
My concern is, with a few of variables it works manually, but I have like around 41 variables to coalesce 
#get common variables for two subset
sub_loc= sub("^loc_", "", loc1)
sub_proce= sub("^proce", "", proce)
ii = intersect(sub_loc, sub_proce) # this give to me 41 variables in common
length(ii)
what I've tried so far to implement a loop in R and create the new variables is this:
for (i in ii) {
df_populated <- df %>% 
 mutate(paste("loc", i, sep = "") = coalesce(paste("loc", i, sep = ""), paste("proce", i, sep = "")
}
But it doesn't work at all. Any ideas to do life easier with data cleaning? Thanks
Is it correct to assume that every ID only has one value, i.e. if there's an entry in one of the location columns, all the other ones will have NA?
Then this could work:
or without data.table:
Thank for the answer, but this is not what I asked. As I know how to merge the variables (I explained it in my post), I do not know how to do with a list of variables in one step.
The last line of the code above should be:
then if produces the output you wrote you wanted from the example data you gave. If this is not what you want, then please clarify.
EDIT: I now understand you want to process multiple variables in the same way, eventually in one step. Repeating the given code for each variable in a loop should work.
But this do it only for one variable. I need to do it for a list of variables
I don't understand. The
meltcommand will work irrespective of the number of additional columns beyond ID.