Question

Reformatting data for Gene Ontology(GO) analysis using R

1

Entering edit mode

2.2 years ago

Hana ▴ 10

I didn't post the full question here, the posting guide is broken. If anyone has any ideas, I'd be very grateful!

https://stackoverflow.com/questions/70934464/reformatting-data-for-gene-ontologygo-analysis-using-r

r GO • 1.4k views

ADD COMMENT • link updated 2.2 years ago by cpad0112 21k • written 2.2 years ago by Hana ▴ 10

0

Entering edit mode

you don't need to load the data in R just for that. Try this:

$ datamash -Hsg1,2 collapse 3 < test.txt 

GroupBy(A)  GroupBy(B)  collapse(C)
111 1   5,7
111 2   6
222 1   9
222 2   8,10

if you want it in R, try one of these:

> sqldf("select A, B, group_concat(C) C from df group by A, B", method = "raw")

> aggregate(C ~., df, toString)

> df %>%
    group_by(A,B) %>%
    summarize(C = paste(C, collapse = ','))

input:

$ cat test.txt 
A   B   C
111 1   5
111 2   6
111 1   7
222 2   8
222 1   9
222 2   10

ADD REPLY • link 2.2 years ago by cpad0112 21k

score 0 · Answer 1 · 2022-02-02

My tmp.csv is just a file that matches your old format.

library(pacman)
p_load(tidyverse)
tmp <- read.csv('tmp.csv') %>%
    tibble() %>%
    separate(col = V1, into = c('id', 'term'), sep = ' ;') 

tmp %>%
    group_by(id) %>% 
    nest() %>%
    ungroup() %>%
    mutate(data = map(data, unlist),
           data = map(data, paste0, collapse = ' ;')) %>% 
    unnest()

Good luck!

score 0 · Answer 2 · 2022-02-02

0

Entering edit mode

2.2 years ago

Dunois ★ 2.5k

Are you just trying to collect GO terms associated with each ENSIPUG ID into a single row?

#Your data
df <- read.table(text = "ENSIPUG00000001371 ;GO:0008236
ENSIPUG00000001371 ;GO:0008233
ENSIPUG00000001371 ;GO:0070011
ENSIPUG00000001371 ;GO:0016787
ENSIPUG00000001371 ;GO:0017171
ENSIPUG00000001371 ;GO:0140096
ENSIPUG00000001374 ;GO:0005515
ENSIPUG00000001374 ;GO:0003674
ENSIPUG00000001374 ;GO:0005488
ENSIPUG00000001375 ;GO:0008152
ENSIPUG00000001375 ;GO:0008150
ENSIPUG00000001375 ;GO:0016758", sep = ",")


#Libraries we need.
library(magrittr)
library(stringr)
library(dplyr)

#Putting the ENSI identifiers and GO terms in their
#own columns.
df %<>% 
  rowwise() %>%
  mutate(V2 = unlist(str_split(V1, " ;"))[2],
         V1 = unlist(str_split(V1, " ;"))[1])

#Collecting each ENSI identifier's set of GO terms into
#a single row.
df %<>%
  group_by(V1) %>%
  mutate(V2 = paste0(V2, collapse = "; ")) %>%
  ungroup() %>%
  distinct(V1, .keep_all = TRUE)

#Taking a glance at the output.
df
# # A tibble: 3 × 2
#   V1                 V2                                                                    
#   <chr>              <chr>                                                                 
# 1 ENSIPUG00000001371 GO:0008236; GO:0008233; GO:0070011; GO:0016787; GO:0017171; GO:0140096
# 2 ENSIPUG00000001374 GO:0005515; GO:0003674; GO:0005488                                    
# 3 ENSIPUG00000001375 GO:0008152; GO:0008150; GO:0016758   

#Write to file.
#File's saved as "newformat.csv" to whatever location getwd()
#indicates.
write.table(df, file = "newformat.csv", quote = TRUE, sep = ",")

ADD COMMENT • link 2.2 years ago by Dunois ★ 2.5k

0

Entering edit mode

I tried this, thank you for your work! However, in the new .csv all it will return for V2 is NA in every column in the same amount that the GO terms should be. Do you know any reason that it wouldn't display the GO terms?

Below is a sample of the output I'm getting.

"V1","V2"
"1","ENSIPUG00000000007","NA; NA; NA; NA; NA"
"2","ENSIPUG00000000011","NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA"

ADD REPLY • link 2.2 years ago by Hana ▴ 10

0

Entering edit mode

I guess this is happening because your data isn't formatted as I thought it would be.

Could you please share the results of head(df) with us here? (Where df is the name of your R object containing this data.)

ADD REPLY • link 2.2 years ago by Dunois ★ 2.5k

0

Entering edit mode

Here is the head function from the end of what happened when I ran your solution.

 head(df)
# A tibble: 6 × 2
  V1                 V2                                                                         
  <chr>              <chr>                                                                      
1 ENSIPUG00000000007 NA; NA; NA; NA; NA                                                         
2 ENSIPUG00000000011 NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA…
3 ENSIPUG00000000017 NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA…
4 ENSIPUG00000000020 NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA…
5 ENSIPUG00000000023 NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA…
6 ENSIPUG00000000024 NA; NA; NA; NA; NA; NA

Here is what the first few lines of the input file are:

ENSIPUG00000000007  ;GO:0055114
ENSIPUG00000000007  ;GO:0008152
ENSIPUG00000000007  ;GO:0008150
ENSIPUG00000000007  ;GO:0016020
ENSIPUG00000000007  ;GO:0005575
ENSIPUG00000000011  ;GO:0006120
ENSIPUG00000000011  ;GO:0055114

Thank you again, I appreciate you following up with me!

ADD REPLY • link 2.2 years ago by Hana ▴ 10

0

Entering edit mode

Oh I meant head() for the object containing the input file's data. I wanted to check if ENSIPUG00000000007 ;GO:0055114 is being imported as a single column into R (I am assuming you're importing it into a data.frame).

I'm guessing the whitespace between ENSIPUG00000000007 ;GO:0055114 is a tab character (\t)? If it is, can you please replace

#Putting the ENSI identifiers and GO terms in their
#own columns.
df %<>% 
  rowwise() %>%
  mutate(V2 = unlist(str_split(V1, " ;"))[2],
         V1 = unlist(str_split(V1, " ;"))[1])

With this

#Putting the ENSI identifiers and GO terms in their
#own columns.
df %<>% 
  rowwise() %>%
  mutate(V2 = unlist(str_split(V1, "[\\s\\t]+;"))[2],
         V1 = unlist(str_split(V1, "[\\s\\t]+;"))[1])

And try running the script again?

ADD REPLY • link 2.2 years ago by Dunois ★ 2.5k

score 0 · Answer 3 · 2022-02-02

Using data.table:

library(data.table)

# example data
d <- fread(text = "ENSIPUG00000001371 ;GO:0008236
ENSIPUG00000001371 ;GO:0008233
ENSIPUG00000001371 ;GO:0070011
ENSIPUG00000001371 ;GO:0016787
ENSIPUG00000001371 ;GO:0017171
ENSIPUG00000001371 ;GO:0140096
ENSIPUG00000001374 ;GO:0005515
ENSIPUG00000001374 ;GO:0003674
ENSIPUG00000001374 ;GO:0005488
ENSIPUG00000001375 ;GO:0008152
ENSIPUG00000001375 ;GO:0008150
ENSIPUG00000001375 ;GO:0016758", header = FALSE)

d[, paste(V2, collapse = "; "), by = V1]
#                    V1                                                                     V1
# 1: ENSIPUG00000001371 GO:0008236; GO:0008233; GO:0070011; GO:0016787; GO:0017171; GO:0140096
# 2: ENSIPUG00000001374                                     GO:0005515; GO:0003674; GO:0005488
# 3: ENSIPUG00000001375                                     GO:0008152; GO:0008150; GO:0016758