Question

R is assigning unique identifier to the same column names when reading in as a data frame. How to retain all the duplicated columns

0

Entering edit mode

15 months ago

pramach1 ▴ 40

I have some samples in a data frame with different metrics. They were repeatedly sequenced in the lab to get better QC metrics.

When I read in the data frame using df <- read.csv("samples.csv")

R assigns the unique identifier for the same column names. Like this,

SMP000113706    SMP000113706.1  SMP000113707    SMP000113707.1

I want to keep only the duplicated columns, and not any of the unique columns. But I have almost 190 columns in a 500 column data frame with diff numbers and unique identifier, there is no pattern to it. How do I retain only the duplicated columns.

SMP000114738 SMP000114739 SMP000114740 SMP000114741 SMP000114982 SMP000114982.1     
1.217835036         1.2085439        2.81750655         1.5034578       0.000214017         0.000224536 
1.217835036         1.2085439        2.81750655         1.5034578       0.000214017         0.000224536
 0.007330334         0.1168343        0.02292839         0.3406125       0.348659681         0.425420762

In this I want to retain only SMP000114982 and SMP000114982.1

Thank you for the help.

retaining R duplicates • 1.5k views

ADD COMMENT • link updated 15 months ago by GenoMax 141k • written 15 months ago by pramach1 ▴ 40

0

Entering edit mode

I'd probably just do

df <- read.csv(...,  header = F)
colnames(df) <- df[1,]
df <- df[-1,]

ADD REPLY • link 15 months ago by bkleiboeker ▴ 370

score 3 · Accepted Answer · 2023-01-06

3

Entering edit mode

15 months ago

Gordon Smyth ★ 7.0k

Using base R:

x <- read.csv("samples.csv", check.names=FALSE)
d <- duplicated(names(x))
y <- x[, names(x) %in% names(x)[d]]

The option check.names=FALSE tells R not to add the unique identifiers.

ADD COMMENT • link 15 months ago by Gordon Smyth ★ 7.0k

score 2 · Accepted Answer · 2023-01-06

2

Entering edit mode

15 months ago

acvill ▴ 340

Here's an approach that uses the dplyr functions select() and contains(). Note that my code also uses R's native pipe |>.

df <- data.frame(A = c(1,2,3),
                 B = c(4,5,6),
                 B.1 = c(7,8,9),
                 C = c(0,1,2), 
                 C.1 = c(3,4,5))

counts <- names(df) |> gsub(pattern = "\\.[0-9]", replacement = "") |> table()

dups <- names(counts[which(counts > 1)])

df |> dplyr::select(dplyr::contains(dups))

I assume that the period character (.) only appears once in duplicated column names.