Question

Retrieving GO mappings for GOseq analysis with non-native organism

3

Entering edit mode

3.9 years ago

thomas.welch ▴ 50

Hi there,

I'm currently conducting a transcriptomics study on Nicotiana benthamiana, and have reached a point where i would like to conduct a GO enrichment analysis using GOseq or topGO. Of course neither of these have N. benthamiana supported natively, so i would have to provide the GO mappings N. benthamiana genes myself. Other studies using the same sequence and annotations have used GOseq and topGO, but do not go into detail in their methods sections how they did this.

Of course i could simply feed a fasta of the DEGs into blast2go, but i don't actually have blast2go, and i feel like i'm missing something.

I cannot find a .GAF file (ontology annotation) in the annotations section of the N. benthamiana sequence website (ftp://ftp.solgenomics.net/genomes/Nicotiana_benthamiana/annotation/Niben101/). But i can see the GO terms for each gene in the .gff file, and they also provide a human readable functional annotations file (in .txt format) which contains GO terms (see picture at bottom). I could write a simple script to strip this .txt file down to a list of genes and GO terms, then use the method outlined here (https://bioinformatics.stackexchange.com/questions/568/how-can-i-create-my-own-go-association-file-gaf) to create a .GAF file, but i'm convinced there's an easier way.

Is there some way i can derive the GO mapping i need for GOseq from what is already provided?

enter image description here

gene ontology GOseq topGO GAF gff • 2.8k views

ADD COMMENT • link 3.9 years ago by thomas.welch ▴ 50

i.sudbery · Answer 1 · 2020-05-28

2

Entering edit mode

3.9 years ago

i.sudbery 19k

GOseq takes its gene to category associations as a named listed of vectors, where each vector is the terms associated with each gene, as such, a GAF file isn't neccessarily the end of the solution anyway.

I'd create this object from the annotation txt file.

Something like:

library(tidyr)
annoations <- read.delim("annotations.txt")
annotations %>%
  select(Protein.Accession, Gene.Ontolog.ID..Name.) %>%
  seperate_rowsGene.Ontology.ID..Name., sep=",") -> annotations

gene2cat <- split(annoations$Gene.Ontology.ID..Name, annotations$Protein.Accession)

gene2cat can then be passed to the goseq function. Note that to run goseq properly, you'll also need a bias.data vector with the length of each gene.

ADD COMMENT • link 3.9 years ago by i.sudbery 19k

0

Entering edit mode

Thank you for the help, I'm going through the same problem. What I can't understand is how the gene2cat file should look. In my case, I have a table with the gene ids and the GO terms divided by category (GO_Biological, Go_Cellular, etc) (see picture at bottom). I have done a try using GO_Biological with your tip:

annotation1000 <- read_excel("annotation1000.xls")
annotation1000 %>%
  select(GeneID, GO_Biological) %>%
 separate_rows(GO_Biological, sep=",") -> annotation1000
gene2cat <- split(annotation1000$GO_Biological, annotation1000$Query_Sequence)

The output looks good (each gene with its bunch of GO terms). However, when I run the following:

 go.wall <- goseq(pwf.3, gene2cat = gene2cat, test.cats = c("GO:BP"), method="Wallenius")

or

  go.wall <- goseq(pwf.3, gene2cat = gene2cat, method="Wallenius")

I obtain this error:

 Error in .testForValidKeys(x, keys, keytype, fks) : 
   None of the keys entered are valid keys for 'GOID'. Please use the keys method to see a listing of valid arguments.

I don't know how to prepare the gene2cat object properly, in addition, I would like to use all the GO categories at the same time. Do you have any suggestions on how to proceed? Thank you!!

enter image description here

ADD REPLY • link updated 3.9 years ago by i.sudbery 19k • written 3.9 years ago by Cristina Zamora • 0

1

Entering edit mode

If you are providing your own categories, then the test.cats parameter doesn't apply. All your categories are BP, so you don't need to tell it to run only those categories.

Also, why annotation1000$Query_sequence rather than annotation1000$GeneID?

ADD REPLY • link 3.9 years ago by i.sudbery 19k

0

Entering edit mode

Thank you. Finally I had to modify all the document with BASH and R forcing it to look like "GENEID" "ONTOLOGY" "GOID" "TERM" PITA_XXXXX CC GO:XXXXX cellular_component Individual GO terms in each row for all the genes. By the way, annotation1000$Query_sequence was my mistake.

ADD REPLY • link 3.9 years ago by Cristina Zamora • 0

score 0 · Answer 2 · 2020-05-28

Thank you very much, this is very helpful.

This seems to work almost perfectly, the functions in brackets next to the GO terms in the functional annotations file are un-helpfully worded, so i will have to use a few lines of gsub to remove bits of leftover text, e.g.

annotations$Gene.Ontology = gsub("\\s*\\([^\\)]+\\)","",as.character(annotations$Gene.Ontology.ID..Name.))
annotations$Gene.Ontology = gsub(" glycohydrolase activity)","",as.character(annotations$Gene.Ontology))
annotations$Gene.Ontology = gsub("+ mRNA export from nucleus)","",as.character(annotations$Gene.Ontology))

so that i have only the geneIDs and GO terms remaining, mapped in many-to-many format, as required by GOseq.