Question

How can I extract duplicated rows for the same value and write them down in one row

0

Entering edit mode

2.7 years ago

amal.elzemrany ▴ 30

I have a tsv file for duplicated genes and their transcripts, I need to extract each duplicated gene with their transcripts in one row using bash

input:

STRG.8 STRG.8.1  
STRG.8 STRG.8.2 

STRG.88 STRG.88.1

STRG.88 STRG.88.2

I need the output to be the gene with the number of duplicated transcripts and these transcripts like this

STRG.8 2 STRG.8.1, STRG.8.2

STRG.88 2 STRG.88.1, STRG.88.2

bash • 532 views

ADD COMMENT • link updated 2.7 years ago by Ram 43k • written 2.7 years ago by amal.elzemrany ▴ 30

1

Entering edit mode

Is bash your only option, or are you fine with a bit of R?

#Packages needed.
#Install with install.packages().
library(dplyr)
library(magrittr)

#Your data.
df <- read.table(text = "col1 col2
STRG.8 STRG.8.1
STRG.8 STRG.8.2
STRG.88 STRG.88.1
STRG.88 STRG.88.2", header = TRUE)

#Grouping the data.
df2 <- df %>% 
  group_by(col1) %>% 
  arrange(col2, .by_group = TRUE) %>% 
  mutate(col2 = paste0(col2, collapse = ",")) %>%
  distinct(col2, .keep_all = TRUE) %>%
  ungroup()

#Result.
df2
# # A tibble: 2 x 2
#  col1    col2               
#   <chr>   <chr>              
# 1 STRG.8  STRG.8.1,STRG.8.2  
# 2 STRG.88 STRG.88.1,STRG.88.2

#Writing the result to a file named myfile.csv.
write.table(df2, file = "myfile.csv", sep = ",", quote = FALSE, row.names = FALSE)

ADD REPLY • link 2.7 years ago by Dunois ★ 2.5k