Here are the disclaimers first:
- I am not sure this is a pure bioinformatics question, but then I've been working on this for 2 weeks now and I'm a bioinformatician so I guess it qualifies
- It's going to be a (kinda) long question, a typical Ram style rant, so strap in.
I'm trying to integrate data on biological samples from ~35 tabular data files. These could be TXT, CSV or XLS* files. I extracted the content from each file to its own data.frame and then set out merging these data frames to get one huge data frame.
All data frames have an ID column, and that's the column I'd use if I were
SQL-JOINing them. However, they can also share other columns. So, to give an extremely simplified example, consider these two files:
id col1 col2 col3 1 A1 B1 C1 2 A2 B2 <NA> 3 <NA> B3 C3 3 <NA> <NA> C4
id col1 col4 col3 1 A1 D1 C1 2 A2 D2 C2 3 <NA> D3 <NA>
My desired output is:
id col1 col2 col3 col4 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 <NA> B3 C3 D3 3 <NA> <NA> C4 D3
Essentially, for each column, among matching rows, pick the non-
NA value if only one such value exists. If multiple non-
NA values exist, create a new row for each.
I'm using a merge function with
by=intersect(colnames(df1),colnames(df2)),all=T. Is there any way I can use a custom
match function in the
merge function? Am I misunderstanding something about
merge and being overly cautious with the
by=intersect? Should I abandon R and try and tackle this with a programming language?
What do you recommend?