How to split columns into rows based on gene ids in R?
4
0
Entering edit mode
7.2 years ago
BioBing ▴ 150

Hi all,

Does any of you cool R-sharks know how to transform data from this:

Gene    GO_terms
ENO     GO:0000015^GO:0000287^GO:0004634^GO:0006096
CCYL1   GO:0000079
SAP30   GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

To this in R?:

Gene    GO_terms
ENO    GO:0000015
ENO    GO:0000287
ENO    GO:0004634
ENO    GO:0006096
CCYL1    GO:0000079
SAP30    GO:0000118
SAP30    GO:0003677
SAP30    GO:0004407
SAP30    GO:0046872
SAP30    GO:0006351

Thanks from a Birgitte that cannot figure it out :-)

R • 2.8k views
ADD COMMENT
1
Entering edit mode

What's the reason for using R? Just curious.

ADD REPLY
0
Entering edit mode

Because I have to use the "transformed" data in R, so I thought there had to be a smart way to do this. I was not aware that Python/awk etc. is better for text formatting until now. I am still pretty new to all of this "programming stuff" - learning every day :-)

ADD REPLY
1
Entering edit mode

If you would like to see a Python option, I can provide one.

ADD REPLY
0
Entering edit mode

If it is not too much trouble, I would love to see an example :-) Thank you!

ADD REPLY
1
Entering edit mode

You should not use R for such text formatting. Its better to use scripting language such as Python or database like psql.

ADD REPLY
2
Entering edit mode
7.2 years ago

This appears to work, assuming your data is in the data-frame 'df':

#Create a new empty data-frame
dfNew <- data.frame()

#Loop through each row in your current data-frame
for (i in 1:nrow(df))
{
    #Break up the elements in column #2 by the carat symbol ('^'),
    #and convert into a 1-column data-frame
    elements <- t(do.call(rbind, strsplit(as.character(df[i,2]), "^", TRUE)))

    #Determine how many elements were produced
    iNumElements <- nrow(elements)

    #Repeat the gene name by the number of elements relating to each
    strGeneNames <- rep(df[i,1], iNumElements)

    #Bind the new rows to the new data-frame
    dfNew <- rbind(dfNew, data.frame(strGeneNames, elements))
}

colnames(dfNew) <- c("Gene", "GO_terms")

dfNew

    Gene   GO_terms
1    ENO GO:0000015
2    ENO GO:0000287
3    ENO GO:0004634
4    ENO GO:0006096
5  CCYL1 GO:0000079
6  SAP30 GO:0000118
7  SAP30 GO:0003677
8  SAP30 GO:0004407
9  SAP30 GO:0046872
10 SAP30 GO:0006351
ADD COMMENT
0
Entering edit mode

It works :-) It is slow, but it works! Thank you so much

ADD REPLY
0
Entering edit mode

Yes, if you have a large amount of text data, it will be slow. Like the other people in the thread are saying, R is not great for processing a large amount of text data. Python is the King at that, and the awk program in linux is also really great.

ADD REPLY
2
Entering edit mode
7.2 years ago
st.ph.n ★ 2.7k

Python solution.

#!/usr/bin/env python

import sys

# Open input file
with open(sys.argv[1], 'r') as f:
    # Empty dictionary
    go = {}
    # grab header
    header = next(f).split('\t')
    # for each line in file
    for line in f:
        # Dictionary {key: value}, {gene: ['go1', 'go2'..]}
        go[line.strip().split('\t')[0]] = line.strip().split('^')[1]

# reprint header
print '\t'.join(header)
# for each gene in dictionary
for i in go:
    # for each go in list by gene key, print gene\tgo
    for n in range(len(go[i])):
        print i, '\t', go[i][n]

Save as get_GOs.py, run as python get_GOs.py input.txt > output.txt

ADD COMMENT
1
Entering edit mode
7.2 years ago
test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")

input:

> test
       Gene                                               GO_terms
    1   ENO            GO:0000015^GO:0000287^GO:0004634^GO:0006096
    2 CCYL1                                             GO:0000079
    3 SAP30 GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

code:

> library(dplyr)
> ddply(test, .(Gene),  function(x) data.frame(GO_terms=str_split(x$GO_terms, "\\^")[[1]]))

or

> ddply(test, .(Gene),transform, test = str_split(GO_terms, "\\^")[[1]])[,c(1,3)]

Adapted from: https://stackoverflow.com/questions/12629287/melt-a-table-data-frame-based-on-values-of-comma-separated-character-vector-co

output:

    Gene   GO_terms
1  CCYL1 GO:0000079
2    ENO GO:0000015
3    ENO GO:0000287
4    ENO GO:0004634
5    ENO GO:0006096
6  SAP30 GO:0000118
7  SAP30 GO:0003677
8  SAP30 GO:0004407
9  SAP30 GO:0046872
10 SAP30 GO:0006351
ADD COMMENT
1
Entering edit mode

Another easy solution:

library(splitstackshape)
test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")
test$GO_terms=gsub('\\^',";", test$GO_terms)
test2=cSplit(test, "GO_terms", ";", "long")

output:

> cSplit(test, "GO_terms", ";", "long")
     Gene   GO_terms
 1:   ENO GO:0000015
 2:   ENO GO:0000287
 3:   ENO GO:0004634
 4:   ENO GO:0006096
 5: CCYL1 GO:0000079
 6: SAP30 GO:0000118
 7: SAP30 GO:0003677
 8: SAP30 GO:0004407
 9: SAP30 GO:0046872
10: SAP30 GO:0006351
ADD REPLY
1
Entering edit mode
7.2 years ago

R isn't great for text processing. Use a better tool, like awk. Here's a simple awk script that will process your file the way you want:

$ awk '{ n = split($2, a, "^"); for (i=1; i <= n; i++) { printf("%s\t%s\n", $1, a[i]); } }' go.txt
Gene    GO_terms
ENO GO:0000015
ENO GO:0000287
ENO GO:0004634
ENO GO:0006096
CCYL1   GO:0000079
SAP30   GO:0000118
SAP30   GO:0003677
SAP30   GO:0004407
SAP30   GO:0046872
SAP30   GO:0006351
ADD COMMENT
0
Entering edit mode

Thank you, I am new to all this "programming stuff" - learning every day :-) I tried your code, but I cannot get it to work for some reason - it returns a list of the "Gene" names without the GO_terms in the terminal, but the txt file still looks the same.

But I will definitely read up on awk and how to use it when I get the time. For this round I use Kevin Blighe's suggestion

ADD REPLY

Login before adding your answer.

Traffic: 839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6