Question

GOseq gene2cat format errors

0

Entering edit mode

4.6 years ago

et395 • 0

Hi there Biostars, I'm working on a GO enrichment analysis for some wheat RNAseq data. I'd like to use the package GOseq for this and have been following the vignette. The package requires 3 data sets, first: a vector of all the genes in your transcriptome, with a '1' denoting DE genes, and '0' for non-DE genes, second: a vector for all of the genes, with the length of each gene, and third: a data frame with two columns for all of the genes and GO terms (each gene will have multiple GO terms so repeating rows), OR a list of lists where the name of each list is the gene name with a list of GO terms.

I had no problem fitting the Probability Weighting Function (PWF) with: pwf = nullp(DEgenes, bias.data = my_length_vector)

The GO terms I downloaded for wheat from BioMart are in the two column data frame format, so that's what I tried first with the code:
GO.wall = goseq(pwf, gene2cat = wheat_GO_terms) but get three errors:
Error: node stack overflow.
Error during wrapup: node stack overflow.
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

Does anyone know how to overcome these errors in GOseq?
I manually created a very short list of lists to see if that works, and it does but I am struggling to create the list of lists from the two column data frame with repeating row values. The GOseq manual indicates the data frame approach should work.

I'd love to hear from you if you've had success with the data frame input format. OR if you can help with converting data frame of repeating row gene names associated with unique GO terms in the second column to a list of lists where gene name lists of the GO terms, that would be fantastic. Thank you!!

RNA-Seq software error R • 2.0k views

ADD COMMENT • link 4.6 years ago by et395 • 0

score 0 · Answer 1 · 2020-11-23

0

Entering edit mode

4.6 years ago

et395 • 0

I tried to solve my problem with the function split() - for example, df is the 2 column vector with repeating gene ids and unique GO terms. split(df$gene_id, df$GO)

This produces a list of lists where the list names are GO terms, and the lists are of gene_ids, which does work for gene2cat but I'm not sure if it is correct since the manual calls for a list of lists where the names are gene_ids and the lists are GO terms. I tried the reverse, split(df$GO, df$GO) but the output is strange, and just becomes a list of all the gene ides with no GO terms as entries.

ADD COMMENT • link 4.6 years ago by et395 • 0

0

Entering edit mode

I tried the reverse, split(df$GO, df$GO)

is just a typo, right? You have tried split(df$GO, df$gene_id)?

ADD REPLY • link 4.6 years ago by e.rempel ★ 1.1k

0

Entering edit mode

Hi yes, my mistake, that is a typo! Thanks for your comment.

My gene names are formatted based on the reference genome annotation, for example "TraesCS4A02G403700". The .txt file for the df has gene names and GO terms, which came directly from BioMart ensemble download. I read in the .txt file with read_delim( file path, delim = ",") and get two columns of character variables.

split(df$GO, df$gene_id) produces a list of lists that is the total length of the # of unique gene_ids but the lists have a different gene name format, for example "ENSRNA050007810" and the list is just length "NA".

When I run the inverse of what GOseq wants, split(df$gene_id, df$GO) I get a nice list of lists that is the total length of the unique GO terms, and the name of each list is a GO term filled with the associated gene_ids of the appropriate format, "TraesCS4A02G403700".

I am pretty stumped - I've never come across something like this before. Thanks!

ADD REPLY • link 4.6 years ago by et395 • 0

score 0 · Answer 2 · 2020-11-24

0

Entering edit mode

4.6 years ago

et395 • 0

I think the BioMart download must have a bug. I ended up going into the raw .gff3 files and extracting the gene names and GO terms, and was able to successfully create the gene2cat object

ADD COMMENT • link 4.6 years ago by et395 • 0