Question

How to create a custom Gene-to-GO mapping file for TopGO

1

Entering edit mode

2.5 years ago

ges29 ▴ 50

Hi,

So I'm trying to do some GO term enrichment analysis for some custom annotations, using the TopGO package in R.

I'm following section 4.3 of the user guide, found here.

The data needs to be in the following format (Note: file should have two, tab-delimited columns. The second of which should list the corresponding GO terms, separated by commas):

068724  GO:0005488, GO:0003774, GO:0001539, GO:0006935, GO:0009288
119608  GO:0005634, GO:0030528, GO:0006355, GO:0045449, GO:0003677, GO:0007275
049239  GO:0016787, GO:0017057, GO:0005975, GO:0005783, GO:0005792, GO:0004345, GO:0005788, GO:0047936, GO:0006098, GO:0005488, GO:0006006, GO:0055114, GO:0016491
067829  GO:0045926, GO:0016616, GO:0000287, GO:0030145, GO:0005739, GO:0000166, GO:0005575, GO:0006099, GO:0005524, GO:0008152, GO:0006102, GO:0005759, GO:0005975, GO:0004449, GO:0055114, GO:0016491

However, my data currently looks like this:

QBM89824.1  GO:0072659
QBM86167.1  GO:0070072
QBM87744.1  GO:0031307
QBM87744.1  GO:0045040
QBM87744.1  GO:0070096
QBM87389.1  GO:0000500
QBM87389.1  GO:0042790
QBM85935.1  GO:0035859
QBM85935.1  GO:0050790
QBM85935.1  GO:0005096
QBM85935.1  GO:0042819
QBM85935.1  GO:0032007

I'm having trouble transforming my data to look like the required format. There's currently over 11k rows, so sorting it out manually isn't an option. Does anyone know of any methods for doing this? I'm comfortable using Python but not so much with R

Thanks in advance!

TopGO annotation mapping R transform • 1.6k views

ADD COMMENT • link 2.5 years ago by ges29 ▴ 50

score 2 · Answer 1 · 2021-10-18

So I've got a solution, though it's rather convoluted and likely not the most efficient!

Using a shorter example of different food type data:

> df
   Type       Food
1 Fruit      Apple
2 Fruit       Pear
3 Fruit     Banana
4 Fruit     Orange
5   Veg     Carrot
6   Veg     Potato
7   Veg    Parsnip
8 Pulse   Chickpea
9 Pulse Broad Bean

First, use pivot_wider (from tidyR). The line below incorporates an answer from here. It's necessary to create a unique identifier for each Type, otherwise the output contains a list-column.

> df_pivot <- df %>% group_by(Type) %>% mutate(row = row_number()) %>% tidyr::pivot_wider(names_from = Type, values_from = Food) %>% select(-row)

This should produce a tibble with NAs where the gaps are.

> df_pivot
# A tibble: 4 × 3
  Fruit  Veg     Pulse     
  <chr>  <chr>   <chr>     
1 Apple  Carrot  Chickpea  
2 Pear   Potato  Broad Bean
3 Banana Parsnip NA        
4 Orange NA      NA

Next, this needs to be transposed so the columns become rows again:

> transposed_df_pivot <-as.data.frame(t(as.matrix(df_pivot)))
> transposed_df_pivot
            V1         V2      V3     V4
Fruit    Apple       Pear  Banana Orange
Veg     Carrot     Potato Parsnip   <NA>
Pulse Chickpea Broad Bean    <NA>   <NA>

For those interested in finalising the formatting ready for TopGO, I ended up then exporting transposed_df_pivot as a csv, removed the column names in excel, then did the rest in python, as below:

>>>Import pandas a pd

# load data with no headers
>>> df = pd.read_csv('your-exported-R-dataframe.csv', header=None)

# Create a new column 'combined', containing all values from columns 1-4 in each row, separated by commas, while also ignoring NaNs
>>> df["combined"] = df[df.columns[1:62]].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)

# Combine the columns you need into a new df and export as a tab-delimited file with .map file extension
>>> df2 = df[[0, 'combined']]
>>> df2.to_csv('Your_file_name.map', sep = '\t', index=False, header=False)