Question: How to split columns into rows based on gene ids in R?
0
gravatar for BioBing
9 months ago by
BioBing90
Denmark
BioBing90 wrote:

Hi all,

Does any of you cool R-sharks know how to transform data from this:

Gene    GO_terms
ENO     GO:0000015^GO:0000287^GO:0004634^GO:0006096
CCYL1   GO:0000079
SAP30   GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

To this in R?:

Gene    GO_terms
ENO    GO:0000015
ENO    GO:0000287
ENO    GO:0004634
ENO    GO:0006096
CCYL1    GO:0000079
SAP30    GO:0000118
SAP30    GO:0003677
SAP30    GO:0004407
SAP30    GO:0046872
SAP30    GO:0006351

Thanks from a Birgitte that cannot figure it out :-)

R • 537 views
ADD COMMENTlink modified 9 months ago by st.ph.n2.2k • written 9 months ago by BioBing90
1

What's the reason for using R? Just curious.

ADD REPLYlink written 9 months ago by st.ph.n2.2k

Because I have to use the "transformed" data in R, so I thought there had to be a smart way to do this. I was not aware that Python/awk etc. is better for text formatting until now. I am still pretty new to all of this "programming stuff" - learning every day :-)

ADD REPLYlink written 9 months ago by BioBing90
1

If you would like to see a Python option, I can provide one.

ADD REPLYlink written 9 months ago by st.ph.n2.2k

If it is not too much trouble, I would love to see an example :-) Thank you!

ADD REPLYlink written 9 months ago by BioBing90
1

You should not use R for such text formatting. Its better to use scripting language such as Python or database like psql.

ADD REPLYlink written 9 months ago by Renesh1.2k
2
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

This appears to work, assuming your data is in the data-frame 'df':

#Create a new empty data-frame
dfNew <- data.frame()

#Loop through each row in your current data-frame
for (i in 1:nrow(df))
{
    #Break up the elements in column #2 by the carat symbol ('^'),
    #and convert into a 1-column data-frame
    elements <- t(do.call(rbind, strsplit(as.character(df[i,2]), "^", TRUE)))

    #Determine how many elements were produced
    iNumElements <- nrow(elements)

    #Repeat the gene name by the number of elements relating to each
    strGeneNames <- rep(df[i,1], iNumElements)

    #Bind the new rows to the new data-frame
    dfNew <- rbind(dfNew, data.frame(strGeneNames, elements))
}

colnames(dfNew) <- c("Gene", "GO_terms")

dfNew

    Gene   GO_terms
1    ENO GO:0000015
2    ENO GO:0000287
3    ENO GO:0004634
4    ENO GO:0006096
5  CCYL1 GO:0000079
6  SAP30 GO:0000118
7  SAP30 GO:0003677
8  SAP30 GO:0004407
9  SAP30 GO:0046872
10 SAP30 GO:0006351
ADD COMMENTlink written 9 months ago by Kevin Blighe21k

It works :-) It is slow, but it works! Thank you so much

ADD REPLYlink written 9 months ago by BioBing90

Yes, if you have a large amount of text data, it will be slow. Like the other people in the thread are saying, R is not great for processing a large amount of text data. Python is the King at that, and the awk program in linux is also really great.

ADD REPLYlink written 9 months ago by Kevin Blighe21k
2
gravatar for st.ph.n
9 months ago by
st.ph.n2.2k
Philadelphia, PA
st.ph.n2.2k wrote:

Python solution.

#!/usr/bin/env python

import sys

# Open input file
with open(sys.argv[1], 'r') as f:
    # Empty dictionary
    go = {}
    # grab header
    header = next(f).split('\t')
    # for each line in file
    for line in f:
        # Dictionary {key: value}, {gene: ['go1', 'go2'..]}
        go[line.strip().split('\t')[0]] = line.strip().split('^')[1]

# reprint header
print '\t'.join(header)
# for each gene in dictionary
for i in go:
    # for each go in list by gene key, print gene\tgo
    for n in range(len(go[i])):
        print i, '\t', go[i][n]

Save as get_GOs.py, run as python get_GOs.py input.txt > output.txt

ADD COMMENTlink written 9 months ago by st.ph.n2.2k
1
gravatar for cpad0112
9 months ago by
cpad01126.3k
cpad01126.3k wrote:
test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")

input:

> test
       Gene                                               GO_terms
    1   ENO            GO:0000015^GO:0000287^GO:0004634^GO:0006096
    2 CCYL1                                             GO:0000079
    3 SAP30 GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

code:

> library(dplyr)
> ddply(test, .(Gene),  function(x) data.frame(GO_terms=str_split(x$GO_terms, "\\^")[[1]]))

or

> ddply(test, .(Gene),transform, test = str_split(GO_terms, "\\^")[[1]])[,c(1,3)]

Adapted from: https://stackoverflow.com/questions/12629287/melt-a-table-data-frame-based-on-values-of-comma-separated-character-vector-co

output:

    Gene   GO_terms
1  CCYL1 GO:0000079
2    ENO GO:0000015
3    ENO GO:0000287
4    ENO GO:0004634
5    ENO GO:0006096
6  SAP30 GO:0000118
7  SAP30 GO:0003677
8  SAP30 GO:0004407
9  SAP30 GO:0046872
10 SAP30 GO:0006351
ADD COMMENTlink modified 9 months ago • written 9 months ago by cpad01126.3k
1

Another easy solution:

library(splitstackshape)
test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")
test$GO_terms=gsub('\\^',";", test$GO_terms)
test2=cSplit(test, "GO_terms", ";", "long")

output:

> cSplit(test, "GO_terms", ";", "long")
     Gene   GO_terms
 1:   ENO GO:0000015
 2:   ENO GO:0000287
 3:   ENO GO:0004634
 4:   ENO GO:0006096
 5: CCYL1 GO:0000079
 6: SAP30 GO:0000118
 7: SAP30 GO:0003677
 8: SAP30 GO:0004407
 9: SAP30 GO:0046872
10: SAP30 GO:0006351
ADD REPLYlink modified 9 months ago • written 9 months ago by cpad01126.3k
1
gravatar for Alex Reynolds
9 months ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

R isn't great for text processing. Use a better tool, like awk. Here's a simple awk script that will process your file the way you want:

$ awk '{ n = split($2, a, "^"); for (i=1; i <= n; i++) { printf("%s\t%s\n", $1, a[i]); } }' go.txt
Gene    GO_terms
ENO GO:0000015
ENO GO:0000287
ENO GO:0004634
ENO GO:0006096
CCYL1   GO:0000079
SAP30   GO:0000118
SAP30   GO:0003677
SAP30   GO:0004407
SAP30   GO:0046872
SAP30   GO:0006351
ADD COMMENTlink written 9 months ago by Alex Reynolds24k

Thank you, I am new to all this "programming stuff" - learning every day :-) I tried your code, but I cannot get it to work for some reason - it returns a list of the "Gene" names without the GO_terms in the terminal, but the txt file still looks the same.

But I will definitely read up on awk and how to use it when I get the time. For this round I use Kevin Blighe's suggestion

ADD REPLYlink written 9 months ago by BioBing90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 910 users visited in the last hour