Question

Split comma seperated list of GO terms into multiple rows and maintain gene identifier in each

0

Entering edit mode

6.2 years ago

Biogeek ▴ 480

I have a tab separated dataset, although the GO terms are comma separated.

GENEID1   GO:XXXXX,GO:YYYYYY,GO:ZZZZZZ

I want to make it so that the dataset becomes a tab-seperated dataset where each GO term is represented on a new line with the gene identifier:

GENEID1  GO:XXXXX
GENEID1  GO:YYYYYY
GENEID1  GO:ZZZZZZ

Many thanks.

Gene ontology data manipulation • 2.8k views

ADD COMMENT • link updated 6.2 years ago by st.ph.n ★ 2.7k • written 6.2 years ago by Biogeek ▴ 480

0

Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 6.2 years ago by WouterDeCoster 48k

0

Entering edit mode

Is it all tab separated? there are what look like tabs and commas in your example input.

ADD REPLY • link 6.2 years ago by Joe 22k

0

Entering edit mode

A perl one liner could be

perl -ane '{print map {$F[0]."\t".$_."\n" } @F[1..$#F] }' your_input_file |sed -s 's/,$//'

ADD REPLY • link 6.2 years ago by microfuge ★ 2.0k

1

Entering edit mode

This outputs the same as the input if your_input_file is created with echo -e "GENEID1\tGO:XXXXX,GO:YYYYYY,GO:ZZZZZZ" > your_input_file. Might there be a slight typo?

ADD REPLY • link 6.2 years ago by jean.elbers ★ 1.7k

score 3 · Answer 1 · 2019-04-15

3

Entering edit mode

6.2 years ago

jean.elbers ★ 1.7k

Do you want to do this in R (possible) or other tools (also possible)?

echo -e "GENEID1\tGO:XXXXX,GO:YYYYYY,GO:ZZZZZZ" > test.txt

in R

library("tidyr")
test <- read.table("test.txt",sep = "\t",header=F)
test
V1                           V2
1 GENEID1 GO:XXXXX,GO:YYYYYY,GO:ZZZZZZ

# use tidyr separate rows to  convert A1\tGO:1,GO:2 to
#                                     A1  GO:1
#                                     A1  GO:2
test2 <- tidyr::separate_rows(data = test,V2,sep = ",")

test2
V1        V2
1 GENEID1  GO:XXXXX
2 GENEID1 GO:YYYYYY
3 GENEID1 GO:ZZZZZZ

ADD COMMENT • link 6.2 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Many thanks for the R version!

ADD REPLY • link 6.2 years ago by Biogeek ▴ 480

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 6.2 years ago by GenoMax 152k

score 2 · Answer 2 · 2019-04-15

2

Entering edit mode

6.2 years ago

microfuge ★ 2.0k

My Apologies for the mistake Can you please try this -

cat your_input_file |perl -ane '{print map {$F[0]."\t".$_."\n" } split (/,/,$F[1]) }'

ADD COMMENT • link 6.2 years ago by microfuge ★ 2.0k

0

Entering edit mode

Perfect! This works. Thanks very much!

ADD REPLY • link 6.2 years ago by Biogeek ▴ 480

score 1 · Answer 3 · 2019-04-15

1

Entering edit mode

6.2 years ago

st.ph.n ★ 2.7k

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
    for line in f:
        for n in range(len(line.strip().split('\t')[1].split(','))):
            print line.strip().split('\t')[0] + 't' + line.strip().split('\t')[1].split(',')[n]

Save as go_tab.py, run as python go_tabl.py input.txt > output.txt

ADD COMMENT • link 6.2 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Many thanks for the python version also!

ADD REPLY • link 6.2 years ago by Biogeek ▴ 480