Question

AnnotationForge and Error in FUN(X[[i]], ...) : data.frames in '...' cannot contain duplicated rows

0

Entering edit mode

2.2 years ago

najibveto ▴ 110

hello, I am trying to make the gene ontology and kegg database for my non model specie, for that purpose i am using the AnnotationForge package using the following this code:

library(tidyverse)
library(clusterProfiler)
library(AnnotationHub)
library(AnnotationForge)
egg <- rio::import('fatheadminnow-annotation.tsv')
egg[egg==""] <- NA 
colnames(egg)
gene_info <- egg %>% dplyr::select(GID = query_name, GENENAME = seed_ortholog) %>% na.omit()
gterms <- egg %>%
  dplyr::select(query_name, GOs) %>% na.omit()
gterms<- gterms[!grepl("-", gterms$GOs),]
library(stringr)
all_go_list=str_split(gterms$GOs,",")
gene2go <- data.frame(GID = rep(gterms$query_name,
                                times = sapply(all_go_list, length)),
                      GO = unlist(all_go_list),
                      EVIDENCE = "IEA")
gene2go<- gene2go[!grepl("-", gene2go$GO),]
gene2ko <- egg %>%
  dplyr::select(GID = query_name, KO = KEGG_ko) %>%
  na.omit()
load("kegg_info.RData")
colnames(ko2pathway)=c("KO",'Pathway')
library(stringr)
gene2ko$KO=str_replace(gene2ko$KO,"ko:","")
gene2ko<- gene2ko[!grepl("-", gene2ko$KO),]
gene2pathway <- gene2ko %>% left_join(ko2pathway, by = "KO") %>% 
  dplyr::select(GID, Pathway) %>%
  na.omit()
library(dplyr)
gene2go <- dplyr::distinct(gene2go)
gene2ko <- dplyr::distinct(gene2ko)
makeOrgPackage(gene_info=gene_info,
               go=gene2go,
               ko=gene2ko,
               maintainer='gmail.com>',
               author='gmail.com>',
               pathway=gene2pathway,
               version="0.0.1",
               outputDir = "C:/Users/Documents",
               tax_id=90988,
               genus="pimephales",
               species="promelas",
               goTable="go")

the table gene2go is a fellow:

enter image description here

the table gene2ko is a fellow:

enter image description here

when i run, i got this error :

Error in FUN(X[[i]], ...) : 
  data.frames in '...' cannot contain duplicated rows

i already used the AnnotationForge for making database for another specie aspergillus niger and it worked fine. what could the problem? and how to solve it? thank you for your help.

AnnotationForge • 1.2k views

ADD COMMENT • link 2.2 years ago by najibveto ▴ 110

1

Entering edit mode

As you already mentioned "data.frames in '...' cannot contain duplicated rows"

Your df gene2go has multiple duplications. Make a file with one gene correspond to multiple GO terms per line:

cep41   GO:0000086,GO:0000226,GO:0000278...

ADD REPLY • link 2.2 years ago by kashiff007 ★ 1.9k

0

Entering edit mode

thank you for your answer, I already used the same package with a different species:

enter image description here

and as u can see that here is duplicate for the same transcript and when I run the same code, I could make the database:

egg <- rio::import('KCN5-annotation.tsv')
egg[egg==""] <- NA 
colnames(egg)
gene_info <- egg %>% dplyr::select(GID = query_name, GENENAME = seed_ortholog) %>% na.omit()
gterms <- egg %>%
  dplyr::select(query_name, GOs) %>% na.omit()
library(stringr)
all_go_list=str_split(gterms$GOs,",")
gene2go <- data.frame(GID = rep(gterms$query_name,
                                times = sapply(all_go_list, length)),
                      GO = unlist(all_go_list),
                      EVIDENCE = "IEA")
gene2go<- gene2go[!grepl("-", gene2go$GO),]
gene2ko <- egg %>%
  dplyr::select(GID = query_name, KO = KEGG_ko) %>%
  na.omit()
load("kegg_info.RData")
colnames(ko2pathway)=c("KO",'Pathway')
library(stringr)
gene2ko$KO=str_replace(gene2ko$KO,"ko:","")
gene2pathway <- gene2ko %>% left_join(ko2pathway, by = "KO") %>% 
  dplyr::select(GID, Pathway) %>%
  na.omit()
makeOrgPackage(gene_info=gene_info,
               go=gene2go,
               ko=gene2ko,
               maintainer='@gmail.com>',
               author='@gmail.com>',
               pathway=gene2pathway,
               version="0.0.1",
               outputDir = "C:/Project05",
               tax_id=5061,
               genus="Aspergillus",
               species="nigerKCN5",
               goTable="go")

and i got the database made:

Populating genes table:
genes table filled
Populating gene_info table:
gene_info table filled
Populating go table:
go table filled
Populating ko table:
ko table filled
Populating pathway table:
pathway table filled
table metadata filled
'select()' returned many:1 mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1 mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled

so that is why it is intriguing to work for one specie and not for the other one.

ADD REPLY • link 2.2 years ago by najibveto ▴ 110