Question: data prerprocessing coding
0
gravatar for sskimvd
26 days ago by
sskimvd0
sskimvd0 wrote:

I have 200 columns and 80,000 rows of data.

Rows contain duplicate data.

I want to extract only the unique data from the duplicated data in the row, but leave only the maximum value in the column among the duplicated rows.

How can I do it?

https://i.ibb.co/72xT8Gv/13.png

thank you

sequencing snp R • 199 views
ADD COMMENTlink modified 13 days ago by rpolicastro3.2k • written 26 days ago by sskimvd0
2
gravatar for rpolicastro
13 days ago by
rpolicastro3.2k
Bloomington, IN
rpolicastro3.2k wrote:

Some example data. Note that I turned the rownames from your example into a column called 'rowid'.

before <- structure(list(rowid = c("r1", "r1", "r1", "r1", "r1", "r2",
"r2", "r2", "r2", "r2"), s1 = c(0L, 3L, 5L, 20L, 15L, 12L, 18L,
15L, 20L, 17L), s2 = c(16L, 13L, 11L, 1L, 18L, 20L, 0L, 3L, 7L,
4L), s3 = c(15L, 7L, 16L, 20L, 1L, 7L, 14L, 6L, 5L, 10L), s4 = c(14L,
10L, 5L, 7L, 18L, 14L, 20L, 1L, 10L, 12L), s5 = c(4L, 9L, 5L,
9L, 7L, 9L, 2L, 5L, 0L, 19L)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))

> before
   rowid s1 s2 s3 s4 s5
1     r1  0 16 15 14  4
2     r1  3 13  7 10  9
3     r1  5 11 16  5  5
4     r1 20  1 20  7  9
5     r1 15 18  1 18  7
6     r2 12 20  7 14  9
7     r2 18  0 14 20  2
8     r2 15  3  6  1  5
9     r2 20  7  5 10  0
10    r2 17  4 10 12 19

tidyverse solution

library("dplyr")

after <- before %>%
  group_by(rowid) %>%
  summarize(across(everything(), max))

> after
# A tibble: 2 x 6
  rowid    s1    s2    s3    s4    s5
  <chr> <int> <int> <int> <int> <int>
1 r1       20    18    20    18     9
2 r2       20    20    14    20    19

data.table solution. This will be quicker and more memory efficient if you have a lot of data and/or limited resources.

library("data.table")

setDT(before)
after <- before[, lapply(.SD, max), by=rowid]

> after
   rowid s1 s2 s3 s4 s5
1:    r1 20 18 20 18  9
2:    r2 20 20 14 20 19
ADD COMMENTlink modified 13 days ago • written 13 days ago by rpolicastro3.2k
0
gravatar for cpad0112
12 days ago by
cpad011214k
Hyderabad India
cpad011214k wrote:

with datamash:

input:

$ cat file.txt 

rowid   s1  s2  s3  s4  s5
r1  0   16  15  14  4
r1  3   13  7   10  9
r1  5   11  16  5   5
r1  20  1   20  7   9
r1  15  18  1   18  7
r2  12  20  7   14  9
r2  18  0   14  20  2
r2  15  3   6   1   5
r2  20  7   5   10  0
r2  17  4   10  12  19

output:

$ datamash -sH -g1 max 2-6 < file.txt 

GroupBy(rowid)  max(s1) max(s2) max(s3) max(s4) max(s5)
r1  20  18  20  18  9
r2  20  20  14  20  19
ADD COMMENTlink written 12 days ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 953 users visited in the last hour
_