How can I extract duplicated rows for the same value and write them down in one row
0
0
Entering edit mode
2.7 years ago

I have a tsv file for duplicated genes and their transcripts, I need to extract each duplicated gene with their transcripts in one row using bash

input:

STRG.8 STRG.8.1  
STRG.8 STRG.8.2 

STRG.88 STRG.88.1

STRG.88 STRG.88.2

I need the output to be the gene with the number of duplicated transcripts and these transcripts like this

STRG.8 2 STRG.8.1, STRG.8.2

STRG.88 2 STRG.88.1, STRG.88.2
bash • 532 views
ADD COMMENT
1
Entering edit mode

Is bash your only option, or are you fine with a bit of R?

#Packages needed.
#Install with install.packages().
library(dplyr)
library(magrittr)

#Your data.
df <- read.table(text = "col1 col2
STRG.8 STRG.8.1
STRG.8 STRG.8.2
STRG.88 STRG.88.1
STRG.88 STRG.88.2", header = TRUE)

#Grouping the data.
df2 <- df %>% 
  group_by(col1) %>% 
  arrange(col2, .by_group = TRUE) %>% 
  mutate(col2 = paste0(col2, collapse = ",")) %>%
  distinct(col2, .keep_all = TRUE) %>%
  ungroup()

#Result.
df2
# # A tibble: 2 x 2
#  col1    col2               
#   <chr>   <chr>              
# 1 STRG.8  STRG.8.1,STRG.8.2  
# 2 STRG.88 STRG.88.1,STRG.88.2

#Writing the result to a file named myfile.csv.
write.table(df2, file = "myfile.csv", sep = ",", quote = FALSE, row.names = FALSE)
ADD REPLY

Login before adding your answer.

Traffic: 2555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6