Question

How to split columns into rows based on gene ids in R?

0

Entering edit mode

7.9 years ago

BioBing ▴ 150

Hi all,

Does any of you cool R-sharks know how to transform data from this:

Gene    GO_terms
ENO     GO:0000015^GO:0000287^GO:0004634^GO:0006096
CCYL1   GO:0000079
SAP30   GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

To this in R?:

Gene    GO_terms
ENO    GO:0000015
ENO    GO:0000287
ENO    GO:0004634
ENO    GO:0006096
CCYL1    GO:0000079
SAP30    GO:0000118
SAP30    GO:0003677
SAP30    GO:0004407
SAP30    GO:0046872
SAP30    GO:0006351

Thanks from a Birgitte that cannot figure it out :-)

R • 3.7k views

ADD COMMENT • link updated 7.8 years ago by st.ph.n ★ 2.7k • written 7.9 years ago by BioBing ▴ 150

1

Entering edit mode

What's the reason for using R? Just curious.

ADD REPLY • link 7.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Because I have to use the "transformed" data in R, so I thought there had to be a smart way to do this. I was not aware that Python/awk etc. is better for text formatting until now. I am still pretty new to all of this "programming stuff" - learning every day :-)

ADD REPLY • link 7.8 years ago by BioBing ▴ 150

1

Entering edit mode

If you would like to see a Python option, I can provide one.

ADD REPLY • link 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

If it is not too much trouble, I would love to see an example :-) Thank you!

ADD REPLY • link 7.8 years ago by BioBing ▴ 150

1

Entering edit mode

You should not use R for such text formatting. Its better to use scripting language such as Python or database like psql.

ADD REPLY • link 7.9 years ago by Renesh ★ 2.2k

2

Entering edit mode

7.8 years ago

st.ph.n ★ 2.7k

Python solution.

#!/usr/bin/env python

import sys

# Open input file
with open(sys.argv[1], 'r') as f:
    # Empty dictionary
    go = {}
    # grab header
    header = next(f).split('\t')
    # for each line in file
    for line in f:
        # Dictionary {key: value}, {gene: ['go1', 'go2'..]}
        go[line.strip().split('\t')[0]] = line.strip().split('^')[1]

# reprint header
print '\t'.join(header)
# for each gene in dictionary
for i in go:
    # for each go in list by gene key, print gene\tgo
    for n in range(len(go[i])):
        print i, '\t', go[i][n]

Save as get_GOs.py, run as python get_GOs.py input.txt > output.txt

ADD COMMENT • link 7.8 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

7.9 years ago

cpad0112 21k

test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")

input:

> test
       Gene                                               GO_terms
    1   ENO            GO:0000015^GO:0000287^GO:0004634^GO:0006096
    2 CCYL1                                             GO:0000079
    3 SAP30 GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

code:

> library(dplyr)
> ddply(test, .(Gene),  function(x) data.frame(GO_terms=str_split(x$GO_terms, "\\^")[[1]]))

or

> ddply(test, .(Gene),transform, test = str_split(GO_terms, "\\^")[[1]])[,c(1,3)]

Adapted from: https://stackoverflow.com/questions/12629287/melt-a-table-data-frame-based-on-values-of-comma-separated-character-vector-co

output:

    Gene   GO_terms
1  CCYL1 GO:0000079
2    ENO GO:0000015
3    ENO GO:0000287
4    ENO GO:0004634
5    ENO GO:0006096
6  SAP30 GO:0000118
7  SAP30 GO:0003677
8  SAP30 GO:0004407
9  SAP30 GO:0046872
10 SAP30 GO:0006351

ADD COMMENT • link 7.9 years ago by cpad0112 21k

1

Entering edit mode

Another easy solution:

library(splitstackshape)
test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")
test$GO_terms=gsub('\\^',";", test$GO_terms)
test2=cSplit(test, "GO_terms", ";", "long")

output:

> cSplit(test, "GO_terms", ";", "long")
     Gene   GO_terms
 1:   ENO GO:0000015
 2:   ENO GO:0000287
 3:   ENO GO:0004634
 4:   ENO GO:0006096
 5: CCYL1 GO:0000079
 6: SAP30 GO:0000118
 7: SAP30 GO:0003677
 8: SAP30 GO:0004407
 9: SAP30 GO:0046872
10: SAP30 GO:0006351

ADD REPLY • link 7.9 years ago by cpad0112 21k

1

Entering edit mode

7.9 years ago

Alex Reynolds 36k

R isn't great for text processing. Use a better tool, like awk. Here's a simple awk script that will process your file the way you want:

$ awk '{ n = split($2, a, "^"); for (i=1; i <= n; i++) { printf("%s\t%s\n", $1, a[i]); } }' go.txt
Gene    GO_terms
ENO GO:0000015
ENO GO:0000287
ENO GO:0004634
ENO GO:0006096
CCYL1   GO:0000079
SAP30   GO:0000118
SAP30   GO:0003677
SAP30   GO:0004407
SAP30   GO:0046872
SAP30   GO:0006351

ADD COMMENT • link 7.9 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you, I am new to all this "programming stuff" - learning every day :-) I tried your code, but I cannot get it to work for some reason - it returns a list of the "Gene" names without the GO_terms in the terminal, but the txt file still looks the same.

But I will definitely read up on awk and how to use it when I get the time. For this round I use Kevin Blighe's suggestion

ADD REPLY • link 7.8 years ago by BioBing ▴ 150

score 2 · Accepted Answer · 2017-09-12

2

Entering edit mode

7.9 years ago

Kevin Blighe 89k

This appears to work, assuming your data is in the data-frame 'df':

#Create a new empty data-frame
dfNew <- data.frame()

#Loop through each row in your current data-frame
for (i in 1:nrow(df))
{
    #Break up the elements in column #2 by the carat symbol ('^'),
    #and convert into a 1-column data-frame
    elements <- t(do.call(rbind, strsplit(as.character(df[i,2]), "^", TRUE)))

    #Determine how many elements were produced
    iNumElements <- nrow(elements)

    #Repeat the gene name by the number of elements relating to each
    strGeneNames <- rep(df[i,1], iNumElements)

    #Bind the new rows to the new data-frame
    dfNew <- rbind(dfNew, data.frame(strGeneNames, elements))
}

colnames(dfNew) <- c("Gene", "GO_terms")

dfNew

    Gene   GO_terms
1    ENO GO:0000015
2    ENO GO:0000287
3    ENO GO:0004634
4    ENO GO:0006096
5  CCYL1 GO:0000079
6  SAP30 GO:0000118
7  SAP30 GO:0003677
8  SAP30 GO:0004407
9  SAP30 GO:0046872
10 SAP30 GO:0006351

ADD COMMENT • link 7.9 years ago by Kevin Blighe 89k

0

Entering edit mode

It works :-) It is slow, but it works! Thank you so much

ADD REPLY • link 7.8 years ago by BioBing ▴ 150

0

Entering edit mode

Yes, if you have a large amount of text data, it will be slow. Like the other people in the thread are saying, R is not great for processing a large amount of text data. Python is the King at that, and the awk program in linux is also really great.

ADD REPLY • link 7.8 years ago by Kevin Blighe 89k