Question: How to split columns into rows based on gene ids in R?
0
gravatar for BioBing
10 weeks ago by
BioBing80
Denmark
BioBing80 wrote:

Hi all,

Does any of you cool R-sharks know how to transform data from this:

Gene    GO_terms
ENO     GO:0000015^GO:0000287^GO:0004634^GO:0006096
CCYL1   GO:0000079
SAP30   GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

To this in R?:

Gene    GO_terms
ENO    GO:0000015
ENO    GO:0000287
ENO    GO:0004634
ENO    GO:0006096
CCYL1    GO:0000079
SAP30    GO:0000118
SAP30    GO:0003677
SAP30    GO:0004407
SAP30    GO:0046872
SAP30    GO:0006351

Thanks from a Birgitte that cannot figure it out :-)

R • 284 views
ADD COMMENTlink modified 10 weeks ago by st.ph.n1.9k • written 10 weeks ago by BioBing80
1

What's the reason for using R? Just curious.

ADD REPLYlink written 10 weeks ago by st.ph.n1.9k

Because I have to use the "transformed" data in R, so I thought there had to be a smart way to do this. I was not aware that Python/awk etc. is better for text formatting until now. I am still pretty new to all of this "programming stuff" - learning every day :-)

ADD REPLYlink written 10 weeks ago by BioBing80
1

If you would like to see a Python option, I can provide one.

ADD REPLYlink written 10 weeks ago by st.ph.n1.9k

If it is not too much trouble, I would love to see an example :-) Thank you!

ADD REPLYlink written 10 weeks ago by BioBing80
1

You should not use R for such text formatting. Its better to use scripting language such as Python or database like psql.

ADD REPLYlink written 10 weeks ago by Renesh1.1k
2
gravatar for Kevin Blighe
10 weeks ago by
Kevin Blighe7.3k
Republic of Ireland (√Čire)
Kevin Blighe7.3k wrote:

This appears to work, assuming your data is in the data-frame 'df':

#Create a new empty data-frame
dfNew <- data.frame()

#Loop through each row in your current data-frame
for (i in 1:nrow(df))
{
    #Break up the elements in column #2 by the carat symbol ('^'),
    #and convert into a 1-column data-frame
    elements <- t(do.call(rbind, strsplit(as.character(df[i,2]), "^", TRUE)))

    #Determine how many elements were produced
    iNumElements <- nrow(elements)

    #Repeat the gene name by the number of elements relating to each
    strGeneNames <- rep(df[i,1], iNumElements)

    #Bind the new rows to the new data-frame
    dfNew <- rbind(dfNew, data.frame(strGeneNames, elements))
}

colnames(dfNew) <- c("Gene", "GO_terms")

dfNew

    Gene   GO_terms
1    ENO GO:0000015
2    ENO GO:0000287
3    ENO GO:0004634
4    ENO GO:0006096
5  CCYL1 GO:0000079
6  SAP30 GO:0000118
7  SAP30 GO:0003677
8  SAP30 GO:0004407
9  SAP30 GO:0046872
10 SAP30 GO:0006351
ADD COMMENTlink written 10 weeks ago by Kevin Blighe7.3k

It works :-) It is slow, but it works! Thank you so much

ADD REPLYlink written 10 weeks ago by BioBing80

Yes, if you have a large amount of text data, it will be slow. Like the other people in the thread are saying, R is not great for processing a large amount of text data. Python is the King at that, and the awk program in linux is also really great.

ADD REPLYlink written 10 weeks ago by Kevin Blighe7.3k
2
gravatar for st.ph.n
10 weeks ago by
st.ph.n1.9k
Philadelphia, PA
st.ph.n1.9k wrote:

Python solution.

#!/usr/bin/env python

import sys

# Open input file
with open(sys.argv[1], 'r') as f:
    # Empty dictionary
    go = {}
    # grab header
    header = next(f).split('\t')
    # for each line in file
    for line in f:
        # Dictionary {key: value}, {gene: ['go1', 'go2'..]}
        go[line.strip().split('\t')[0]] = line.strip().split('^')[1]

# reprint header
print '\t'.join(header)
# for each gene in dictionary
for i in go:
    # for each go in list by gene key, print gene\tgo
    for n in range(len(go[i])):
        print i, '\t', go[i][n]

Save as get_GOs.py, run as python get_GOs.py input.txt > output.txt

ADD COMMENTlink written 10 weeks ago by st.ph.n1.9k
1
gravatar for cpad0112
10 weeks ago by
cpad01123.1k
cpad01123.1k wrote:
test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")

input:

> test
       Gene                                               GO_terms
    1   ENO            GO:0000015^GO:0000287^GO:0004634^GO:0006096
    2 CCYL1                                             GO:0000079
    3 SAP30 GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

code:

> library(dplyr)
> ddply(test, .(Gene),  function(x) data.frame(GO_terms=str_split(x$GO_terms, "\\^")[[1]]))

or

> ddply(test, .(Gene),transform, test = str_split(GO_terms, "\\^")[[1]])[,c(1,3)]

Adapted from: https://stackoverflow.com/questions/12629287/melt-a-table-data-frame-based-on-values-of-comma-separated-character-vector-co

output:

    Gene   GO_terms
1  CCYL1 GO:0000079
2    ENO GO:0000015
3    ENO GO:0000287
4    ENO GO:0004634
5    ENO GO:0006096
6  SAP30 GO:0000118
7  SAP30 GO:0003677
8  SAP30 GO:0004407
9  SAP30 GO:0046872
10 SAP30 GO:0006351
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by cpad01123.1k
1

Another easy solution:

library(splitstackshape)
test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")
test$GO_terms=gsub('\\^',";", test$GO_terms)
test2=cSplit(test, "GO_terms", ";", "long")

output:

> cSplit(test, "GO_terms", ";", "long")
     Gene   GO_terms
 1:   ENO GO:0000015
 2:   ENO GO:0000287
 3:   ENO GO:0004634
 4:   ENO GO:0006096
 5: CCYL1 GO:0000079
 6: SAP30 GO:0000118
 7: SAP30 GO:0003677
 8: SAP30 GO:0004407
 9: SAP30 GO:0046872
10: SAP30 GO:0006351
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by cpad01123.1k
1
gravatar for Alex Reynolds
10 weeks ago by
Alex Reynolds21k
Seattle, WA USA
Alex Reynolds21k wrote:

R isn't great for text processing. Use a better tool, like awk. Here's a simple awk script that will process your file the way you want:

$ awk '{ n = split($2, a, "^"); for (i=1; i <= n; i++) { printf("%s\t%s\n", $1, a[i]); } }' go.txt
Gene    GO_terms
ENO GO:0000015
ENO GO:0000287
ENO GO:0004634
ENO GO:0006096
CCYL1   GO:0000079
SAP30   GO:0000118
SAP30   GO:0003677
SAP30   GO:0004407
SAP30   GO:0046872
SAP30   GO:0006351
ADD COMMENTlink written 10 weeks ago by Alex Reynolds21k

Thank you, I am new to all this "programming stuff" - learning every day :-) I tried your code, but I cannot get it to work for some reason - it returns a list of the "Gene" names without the GO_terms in the terminal, but the txt file still looks the same.

But I will definitely read up on awk and how to use it when I get the time. For this round I use Kevin Blighe's suggestion

ADD REPLYlink written 10 weeks ago by BioBing80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 698 users visited in the last hour