Question: Split comma seperated list of GO terms into multiple rows and maintain gene identifier in each
0
gravatar for Biogeek
13 months ago by
Biogeek390
Biogeek390 wrote:

I have a tab separated dataset, although the GO terms are comma separated.

GENEID1   GO:XXXXX,GO:YYYYYY,GO:ZZZZZZ

I want to make it so that the dataset becomes a tab-seperated dataset where each GO term is represented on a new line with the gene identifier:

GENEID1  GO:XXXXX
GENEID1  GO:YYYYYY
GENEID1  GO:ZZZZZZ

Many thanks.

ADD COMMENTlink modified 13 months ago by st.ph.n2.5k • written 13 months ago by Biogeek390

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 13 months ago by WouterDeCoster43k

Is it all tab separated? there are what look like tabs and commas in your example input.

ADD REPLYlink written 13 months ago by Joe16k

A perl one liner could be

perl -ane '{print map {$F[0]."\t".$_."\n" } @F[1..$#F] }' your_input_file |sed -s 's/,$//'

ADD REPLYlink modified 13 months ago • written 13 months ago by microfuge1.6k
1

This outputs the same as the input if your_input_file is created with echo -e "GENEID1\tGO:XXXXX,GO:YYYYYY,GO:ZZZZZZ" > your_input_file. Might there be a slight typo?

ADD REPLYlink written 13 months ago by jean.elbers1.4k
3
gravatar for jean.elbers
13 months ago by
jean.elbers1.4k
jean.elbers1.4k wrote:

Do you want to do this in R (possible) or other tools (also possible)?

echo -e "GENEID1\tGO:XXXXX,GO:YYYYYY,GO:ZZZZZZ" > test.txt

in R

library("tidyr")
test <- read.table("test.txt",sep = "\t",header=F)
test
V1                           V2
1 GENEID1 GO:XXXXX,GO:YYYYYY,GO:ZZZZZZ

# use tidyr separate rows to  convert A1\tGO:1,GO:2 to
#                                     A1  GO:1
#                                     A1  GO:2
test2 <- tidyr::separate_rows(data = test,V2,sep = ",")

test2
V1        V2
1 GENEID1  GO:XXXXX
2 GENEID1 GO:YYYYYY
3 GENEID1 GO:ZZZZZZ
ADD COMMENTlink modified 13 months ago • written 13 months ago by jean.elbers1.4k

Many thanks for the R version!

ADD REPLYlink written 13 months ago by Biogeek390

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 13 months ago by genomax84k
2
gravatar for microfuge
13 months ago by
microfuge1.6k
microfuge1.6k wrote:

My Apologies for the mistake Can you please try this -

cat your_input_file |perl -ane '{print map {$F[0]."\t".$_."\n" } split (/,/,$F[1]) }'

ADD COMMENTlink written 13 months ago by microfuge1.6k

Perfect! This works. Thanks very much!

ADD REPLYlink written 13 months ago by Biogeek390
1
gravatar for st.ph.n
13 months ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:
#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
    for line in f:
        for n in range(len(line.strip().split('\t')[1].split(','))):
            print line.strip().split('\t')[0] + 't' + line.strip().split('\t')[1].split(',')[n]

Save as go_tab.py, run as python go_tabl.py input.txt > output.txt

ADD COMMENTlink modified 13 months ago • written 13 months ago by st.ph.n2.5k

Many thanks for the python version also!

ADD REPLYlink written 13 months ago by Biogeek390
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1828 users visited in the last hour