Question: Split delimited text into a new row
0
gravatar for Ambika
4 days ago by
Ambika30
United States
Ambika30 wrote:

Hello everyone,

I have a tab delimitted file like this with three columns,

FUN_004018-T1   359 GO:0016491|GO:0046872|GO:0055114
FUN_003797-T1   570 GO:0000287|GO:0030976
FUN_003797-T1   570 GO:0030976
FUN_003797-T1   570 GO:0016831

I want to arrange this file for further analysis like below:

FUN_004018-T1   359 GO:0016491
FUN_004018-T1   359  GO:0046872
FUN_004018-T1   359  GO:0055114
FUN_003797-T1   570 GO:0000287
FUN_003797-T1     570  GO:0030976
FUN_003797-T1   570 GO:0030976
FUN_003797-T1   570 GO:0016831

Could you please help on how I can achieve this. Thank you

awk sed grep • 90 views
ADD COMMENTlink modified 4 days ago by zx87549.6k • written 4 days ago by Ambika30
1

Use awk. Split $3 by | and print $1 $2 $3_split[i] for each component of $3.

Please use the above outline to write code yourself.

Or use R and tidyr::separate_rows - that would be a lot easier. See: https://tidyr.tidyverse.org/reference/separate_rows.html

ADD REPLYlink modified 4 days ago • written 4 days ago by RamRS30k
1
ADD REPLYlink modified 4 days ago • written 4 days ago by zx87549.6k
3
gravatar for Pierre Lindenbaum
4 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:
 awk '{N=split($3,a,/\|/);for(i=1;i<=N;i++) print $1,$2 ,a[i]}' input.txt
ADD COMMENTlink written 4 days ago by Pierre Lindenbaum130k
4
gravatar for RamRS
4 days ago by
RamRS30k
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

Here's a tidyverse solution, just because:

tmp_df <- read.table(text="FUN_004018-T1 359 GO:0016491|GO:0046872|GO:0055114
FUN_003797-T1 570 GO:0000287|GO:0030976
FUN_003797-T1 570 GO:0030976
FUN_003797-T1 570 GO:0016831", sep=" ", header=FALSE, stringsAsFactors = FALSE, col.names = c('col1', 'col2','col3'))

tmp_df
           col1 col2                             col3
1 FUN_004018-T1  359 GO:0016491|GO:0046872|GO:0055114
2 FUN_003797-T1  570            GO:0000287|GO:0030976
3 FUN_003797-T1  570                       GO:0030976
4 FUN_003797-T1  570                       GO:0016831

tidyr::separate_rows(tmp_df, col3, sep = "[|]")
# A tibble: 7 x 3
  col1           col2 col3      
  <chr>         <int> <chr>     
1 FUN_004018-T1   359 GO:0016491
2 FUN_004018-T1   359 GO:0046872
3 FUN_004018-T1   359 GO:0055114
4 FUN_003797-T1   570 GO:0000287
5 FUN_003797-T1   570 GO:0030976
6 FUN_003797-T1   570 GO:0030976
7 FUN_003797-T1   570 GO:0016831
ADD COMMENTlink modified 4 days ago • written 4 days ago by RamRS30k

It's always a GREAT sight when I see somebody post a Tidyverse solution. :D

ADD REPLYlink written 4 days ago by bioinformatics2020350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1958 users visited in the last hour