R is assigning unique identifier to the same column names when reading in as a data frame. How to retain all the duplicated columns
2
0
Entering edit mode
15 months ago
pramach1 ▴ 40

I have some samples in a data frame with different metrics. They were repeatedly sequenced in the lab to get better QC metrics.

When I read in the data frame using df <- read.csv("samples.csv")

R assigns the unique identifier for the same column names. Like this,

SMP000113706    SMP000113706.1  SMP000113707    SMP000113707.1  

I want to keep only the duplicated columns, and not any of the unique columns. But I have almost 190 columns in a 500 column data frame with diff numbers and unique identifier, there is no pattern to it. How do I retain only the duplicated columns.

SMP000114738 SMP000114739 SMP000114740 SMP000114741 SMP000114982 SMP000114982.1     
1.217835036         1.2085439        2.81750655         1.5034578       0.000214017         0.000224536 
1.217835036         1.2085439        2.81750655         1.5034578       0.000214017         0.000224536
 0.007330334         0.1168343        0.02292839         0.3406125       0.348659681         0.425420762

In this I want to retain only SMP000114982 and SMP000114982.1

Thank you for the help.

retaining R duplicates • 1.5k views
ADD COMMENT
0
Entering edit mode

I'd probably just do

df <- read.csv(...,  header = F)
colnames(df) <- df[1,]
df <- df[-1,]
ADD REPLY
3
Entering edit mode
15 months ago
Gordon Smyth ★ 7.0k

Using base R:

x <- read.csv("samples.csv", check.names=FALSE)
d <- duplicated(names(x))
y <- x[, names(x) %in% names(x)[d]]

The option check.names=FALSE tells R not to add the unique identifiers.

ADD COMMENT
2
Entering edit mode
15 months ago
acvill ▴ 340

Here's an approach that uses the dplyr functions select() and contains(). Note that my code also uses R's native pipe |>.

df <- data.frame(A = c(1,2,3),
                 B = c(4,5,6),
                 B.1 = c(7,8,9),
                 C = c(0,1,2), 
                 C.1 = c(3,4,5))

counts <- names(df) |> gsub(pattern = "\\.[0-9]", replacement = "") |> table()

dups <- names(counts[which(counts > 1)])

df |> dplyr::select(dplyr::contains(dups))

I assume that the period character (.) only appears once in duplicated column names.

ADD COMMENT
0
Entering edit mode

This worked. Thank you so much.

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6