data prerprocessing coding
2
0
Entering edit mode
3.3 years ago
sskimvd • 0

I have 200 columns and 80,000 rows of data.

Rows contain duplicate data.

I want to extract only the unique data from the duplicated data in the row, but leave only the maximum value in the column among the duplicated rows.

How can I do it?

https://i.ibb.co/72xT8Gv/13.png

thank you

R SNP sequencing • 616 views
ADD COMMENT
2
Entering edit mode
3.3 years ago

Some example data. Note that I turned the rownames from your example into a column called 'rowid'.

before <- structure(list(rowid = c("r1", "r1", "r1", "r1", "r1", "r2",
"r2", "r2", "r2", "r2"), s1 = c(0L, 3L, 5L, 20L, 15L, 12L, 18L,
15L, 20L, 17L), s2 = c(16L, 13L, 11L, 1L, 18L, 20L, 0L, 3L, 7L,
4L), s3 = c(15L, 7L, 16L, 20L, 1L, 7L, 14L, 6L, 5L, 10L), s4 = c(14L,
10L, 5L, 7L, 18L, 14L, 20L, 1L, 10L, 12L), s5 = c(4L, 9L, 5L,
9L, 7L, 9L, 2L, 5L, 0L, 19L)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))

> before
   rowid s1 s2 s3 s4 s5
1     r1  0 16 15 14  4
2     r1  3 13  7 10  9
3     r1  5 11 16  5  5
4     r1 20  1 20  7  9
5     r1 15 18  1 18  7
6     r2 12 20  7 14  9
7     r2 18  0 14 20  2
8     r2 15  3  6  1  5
9     r2 20  7  5 10  0
10    r2 17  4 10 12 19

tidyverse solution

library("dplyr")

after <- before %>%
  group_by(rowid) %>%
  summarize(across(everything(), max))

> after
# A tibble: 2 x 6
  rowid    s1    s2    s3    s4    s5
  <chr> <int> <int> <int> <int> <int>
1 r1       20    18    20    18     9
2 r2       20    20    14    20    19

data.table solution. This will be quicker and more memory efficient if you have a lot of data and/or limited resources.

library("data.table")

setDT(before)
after <- before[, lapply(.SD, max), by=rowid]

> after
   rowid s1 s2 s3 s4 s5
1:    r1 20 18 20 18  9
2:    r2 20 20 14 20 19
ADD COMMENT
0
Entering edit mode
3.3 years ago

with datamash:

input:

$ cat file.txt 

rowid   s1  s2  s3  s4  s5
r1  0   16  15  14  4
r1  3   13  7   10  9
r1  5   11  16  5   5
r1  20  1   20  7   9
r1  15  18  1   18  7
r2  12  20  7   14  9
r2  18  0   14  20  2
r2  15  3   6   1   5
r2  20  7   5   10  0
r2  17  4   10  12  19

output:

$ datamash -sH -g1 max 2-6 < file.txt 

GroupBy(rowid)  max(s1) max(s2) max(s3) max(s4) max(s5)
r1  20  18  20  18  9
r2  20  20  14  20  19
ADD COMMENT

Login before adding your answer.

Traffic: 2695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6