Left Join Issue in R
0
0
Entering edit mode
9 weeks ago
cthangav ▴ 40

I want to merge two data frames (named 'data' and 'add') so that only the rows of the first dataframe(data) are kept (Left Join).

When I use

Merged <- merge(data, add, by.x = "DATAGene", by.y = "ADDGene", all.x = TRUE, all.y = FALSE)

I get more rows than I had in the first data frame.

dim(data)

[1] 21578 4

dim(add)

[1] 25778 2

dim(Merged)

[1] 21639 5`

Why would this happen and is there a way to avoid it?

Both the DATAGene and ADDGene columns are character columns.

The datatable "add" looks like..

ADDGene V1

1 TSPAN6 51

2 TNMD 0

3 DPM1 114

4 SCYL3 9

5 C1orf112 1

...

87 SPPL2B 6

88 FAM214B 20

89 COPZ2 75

. .

The datatable "data" looks like..

DATAGene V2 V3 V4

1 TSPAN6 294 778 595

2 TNMD 0 8 0

3 DPM1 354 311 696

4 SCYL3 86 94 134

5 C1orf112 147 268 263

...

87 FAM214B 415 115 156

88 COPZ2 82 13 12

89 PRKAR2B 1523 710 250

R • 356 views
ADD COMMENT
2
Entering edit mode

It is highly possible that one of the data.frame has duplicate value of gene name

length(unique(data$DATAGene))==nrow(data)
length(unique(add$ADDGene))==nrow(add)
ADD REPLY
0
Entering edit mode

You are correct, the Data and Merged data tables have the same number of rows if I only count the unique ones. Do you know if there a way to keep only the rows from Data and its duplicates?

ADD REPLY
0
Entering edit mode

I believe removing the duplicates from add using the following command fixed the issue. Thank you

add2 <- add %>% dplyr::distinct(ADDGene, .keep_all = TRUE)
ADD REPLY

Login before adding your answer.

Traffic: 2494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6