R - Merging and aligning two CSVs using common values in multiple columns
1
1
Entering edit mode
2.2 years ago
Michael ▴ 10

(Repost so tables aren't images.) I currently have two .csv files that look like this:

File 1:

ATTEMPT  RESULTS
Int1     B
Int2     H

File 2:

NAME  OUTCOME1  OUTCOME2  OUTCOME3
Sam1  A         B         C
Sam2  D         E         F
Sam3  G         H         I

I would like to merge and align the two .csvs such that the result each row of File 1 is aligned by its "result" cell, against any of the three "outcome" columns in File 2, leaving blanks or "NA"s if there are no similarities.

Ideally, would look like this:

ATTEMPT  RESULTS  NAME  OUTCOME1  OUTCOME2  OUTCOME3
Int1     B        Sam1  A         B         C
                  Sam2  D         E         F
Int2     H        Sam3  G         H         I

Any help would be much appreciated.

Edit: for my real data, I have several thousand rows in each file.

csv R • 1.0k views
ADD COMMENT
0
Entering edit mode
> library(sqldf)
> df1=read.csv("test1.txt", header = T, sep="\t")
> df2=read.csv("test2.txt", header = T, sep="\t")
> sqldf('select df1.*, df2.* from df2 left join df1 on df2.OUTCOME1=df1.RESULTS OR
+       df2.OUTCOME2=df1.RESULTS OR
+       df2.OUTCOME3=df1.RESULTS')

  ATTEMPT RESULTS NAME OUTCOME1 OUTCOME2 OUTCOME3
1    Int1       B Sam1        A        B        C
2    <NA>    <NA> Sam2        D        E        F
3    Int2       H Sam3        G        H        I
ADD REPLY
0
Entering edit mode
2.2 years ago
ning ▴ 120

You should left join file 2 with file 1, using the OUTCOME2/RESULT column. Here's an untested example using R/tidyverse:

# In order to left-join, there must be a shared column name
file1 <- rename(file1, OUTCOME2=RESULTS)
# Left-join returns all the rows from the first argument,
# adding columns from the second argument where the key column (OUTCOME2) value matches,
# or leaving NA otherwise
combined <- left_join(file1, file2, by="OUTCOME2")

# OUTCOME2 and RESULT are redundant columns, 
# but if it is important to you that both columns exist, 
# you can add it back using mutate.
combined <- mutate(combined, RESULTS=OUTCOME2)
# Finally, if ordering of the columns matter to you:
combined <- select(combined, ATTEMPT, RESULTS, NAME, OUTCOME1, OUTCOME2, OUTCOME3)
ADD COMMENT
0
Entering edit mode

This doesn't address OP requirement of "any of the three "outcome" columns in File 2"

ADD REPLY
0
Entering edit mode

My mistake, you're right! Your solution is much better :)

ADD REPLY
0
Entering edit mode

well, there is no better or best solution, it's only working solution. Your solution works given OP data. Solution I proposed works for 3 columns only.

ADD REPLY

Login before adding your answer.

Traffic: 2579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6