Question: How to get GO terms for E. Coli genes in R?
0
gravatar for mamillerpa
6.0 years ago by
mamillerpa40
United States
mamillerpa40 wrote:

I just did a differential gene expression analysis with cufflinks and imported it into R with cummerbund.

My reference genome and GFF annotation were from Ensembl, and the Ensembl IDs for genes were carried through instead of the mnemonic gene symbols. (see below)

Is there a way to get GO terms for each of my E. coli genes within R/bioconductor?  Can I do it with gene identifiers like "gene:b3281", or do I need to translate those to some other identifiers like "aroE" ?

 

biolinux@biolinux-VirtualBox[biolinux] grep -C 1 aroE Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.22.gff3
Chromosome ensembl transcript 3429766 3430023 . - . ID=transcript:AAC76305;Parent=gene:b3280;biotype=protein_coding;external_name=yrdB-1;logic_name=ena
Chromosome ensembl gene 3430020 3430838 . - . ID=gene:b3281;biotype=protein_coding;description=dehydroshikimate reductase%2C NAD(P)-binding;external_name=aroE;logic_name=ena
Chromosome ensembl transcript 3430020 3430838 . - . ID=transcript:AAC76306;Parent=gene:b3281;biotype=protein_coding;external_name=aroE-1;logic_name=ena
Chromosome ensembl gene 3430843 3431415 . - . ID=gene:b3282;biotype=protein_coding;description=tRNA(ANN) t(6)A37 threonylcarbamoyladenosine modification protein%2C threonine-dependent ADP-forming ATPase;external_name=tsaC;logic_name=ena


biolinux@biolinux-VirtualBox[fromoffice] grep b3281 gene_exp.diff
gene:b3281 gene:b3281 - Chromosome:3429235-3430838 q1 q2 OK 24.697 33.6117 0.444627 0.672035 0.3566 0.504885 no

go rna-seq bacteria R • 2.9k views
ADD COMMENTlink modified 6.0 years ago by oganm60 • written 6.0 years ago by mamillerpa40

I have just parsed the "b numbers" and gene symbols out of the Ensmebl GFF file. Then I can get symbol:GO term mappings from topGO

Does this look even remotely reasonable?

EcExpGFF <- read.delim("Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.22.gff3", header=FALSE, stringsAsFactors=FALSE)

#I'm only interested in genes. Column 9 contains semicolon delimited key:value pairs, including the "b number" and the gene symbol "external gene id"
EcGFFgenes <- EcExpGFF[EcExpGFF$V3=="gene", 9]

#split the semicolon delimited list
EcLookupFromGFF <- read.csv(text = as.character(EcGFFgenes), header = FALSE, sep=";", stringsAsFactors = FALSE)

#remove the Key: portion of the text.

EcGenes<-substr(EcLookupFromGFF$V1,4,max(length(EcLookupFromGFF$V1)))
EcSyms <-substr(EcLookupFromGFF$V4,15,max(length(EcLookupFromGFF$V4)))

#now get the mapping between GO terms and the gene symbols. Use topGO library.

GO_BP2sym_list <- annFUN.org(whichOnto = "BP", EcSyms, mapping = "org.EcK12.eg.db", ID = "symbol")
ADD REPLYlink modified 13 months ago by RamRS27k • written 6.0 years ago by mamillerpa40
0
gravatar for oganm
6.0 years ago by
oganm60
Canada
oganm60 wrote:

You can use biomart package for this purpose. But it will not be an enrichment analysis and won't convey much useful information.

mart = useMart("ensembl", dataset="mmusculus_gene_ensembl") #replace mus musculus with your favorite species
goTerms = getBM(attributes = c("ensembl_gene_id", "goslim_goa_accession",
                               "goslim_goa_description"),
                filters = "mgi_symbol",
                values = geneIDs,
                mart = mart) #geneIDs is a vector containing your ensemble gene IDs
ADD COMMENTlink modified 13 months ago by RamRS27k • written 6.0 years ago by oganm60

Thanks, oganm.

> mart = useMart('ensembl') ; 
> listDatasets(mart)

doesn't show anything for E. coli. I was under the impression that Ensembl Bacterial didn't have a mart.

ADD REPLYlink modified 13 months ago by RamRS27k • written 6.0 years ago by mamillerpa40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1079 users visited in the last hour