Question

Split delimited text into a new row

0

Entering edit mode

3.6 years ago

AP ▴ 80

Hello everyone,

I have a tab delimitted file like this with three columns,

FUN_004018-T1   359 GO:0016491|GO:0046872|GO:0055114
FUN_003797-T1   570 GO:0000287|GO:0030976
FUN_003797-T1   570 GO:0030976
FUN_003797-T1   570 GO:0016831

I want to arrange this file for further analysis like below:

FUN_004018-T1   359 GO:0016491
FUN_004018-T1   359  GO:0046872
FUN_004018-T1   359  GO:0055114
FUN_003797-T1   570 GO:0000287
FUN_003797-T1     570  GO:0030976
FUN_003797-T1   570 GO:0030976
FUN_003797-T1   570 GO:0016831

Could you please help on how I can achieve this. Thank you

sed awk grep • 807 views

ADD COMMENT • link updated 3.6 years ago by zx8754 11k • written 3.6 years ago by AP ▴ 80

1

Entering edit mode

Use awk. Split $3 by | and print $1 $2 $3_split[i] for each component of $3.

Please use the above outline to write code yourself.

Or use R and tidyr::separate_rows - that would be a lot easier. See: https://tidyr.tidyverse.org/reference/separate_rows.html

ADD REPLY • link 3.6 years ago by Ram 43k

1

Entering edit mode

ADD REPLY • link 3.6 years ago by zx8754 11k

4

Entering edit mode

3.6 years ago

Ram 43k

Here's a tidyverse solution, just because:

tmp_df <- read.table(text="FUN_004018-T1 359 GO:0016491|GO:0046872|GO:0055114
FUN_003797-T1 570 GO:0000287|GO:0030976
FUN_003797-T1 570 GO:0030976
FUN_003797-T1 570 GO:0016831", sep=" ", header=FALSE, stringsAsFactors = FALSE, col.names = c('col1', 'col2','col3'))

tmp_df
           col1 col2                             col3
1 FUN_004018-T1  359 GO:0016491|GO:0046872|GO:0055114
2 FUN_003797-T1  570            GO:0000287|GO:0030976
3 FUN_003797-T1  570                       GO:0030976
4 FUN_003797-T1  570                       GO:0016831

tidyr::separate_rows(tmp_df, col3, sep = "[|]")
# A tibble: 7 x 3
  col1           col2 col3      
  <chr>         <int> <chr>     
1 FUN_004018-T1   359 GO:0016491
2 FUN_004018-T1   359 GO:0046872
3 FUN_004018-T1   359 GO:0055114
4 FUN_003797-T1   570 GO:0000287
5 FUN_003797-T1   570 GO:0030976
6 FUN_003797-T1   570 GO:0030976
7 FUN_003797-T1   570 GO:0016831

ADD COMMENT • link 3.6 years ago by Ram 43k

0

Entering edit mode

It's always a GREAT sight when I see somebody post a Tidyverse solution. :D

ADD REPLY • link 3.6 years ago by bioinformatics2020 ▴ 820

score 3 · Accepted Answer · 2020-09-16

3

Entering edit mode

3.6 years ago

Pierre Lindenbaum 161k

 awk '{N=split($3,a,/\|/);for(i=1;i<=N;i++) print $1,$2 ,a[i]}' input.txt

ADD COMMENT • link 3.6 years ago by Pierre Lindenbaum 161k