Hello, I am working in a lab and am trying to negotiate R studio. Currently I am using a program called GOexpress. I have created an expression set. During this analysis it appears that many genes have a score of 0, I am wondering if there is a way to filter out such genes so that the GOterm score (which is the average of the scores of its genes) is not lowered by many genes with a score of 0. I am not very computer literate or very good with R studio so detailed instructions would be most appreciated! Thank you in advance
I have created an expression set using these commands:
Assay data
library("Biobase") exprs<-as.matrix(cts) class (exprs) [1] Matrix dim(exprs) [1]5101 34 colnames(exprs) head(exprs[,1:5]) minimalSet <- ExpressionSet(assayData=exprs)
Phenotypic Data
pData <- coldata dim(pData) rownames(pData) summary(pData) all(rownames(pData)==colnames(exprs)) [1] TRUE names(pData) [1] "condition" "timepoint" sapply(pData, class) condition timepoint "factor" "factor" pData[c(15, 20), c("condition", "timepoint")] condition timepoint LN_1.5h_2 LN 1.5h LP_24h_1 LP 24h
Meta Data
metadata <- data.frame(labelDescription= c("CM growth conditions", "growth time"), row.names=c("condition", "timepoint")) phenoData <- new("AnnotatedDataFrame", data=pData, varMetadata=metadata) phenoData An object of class 'AnnotatedDataFrame' rowNames: RM_0.5h_1 RM_0.5h_2 ... LN_49h_3 (34 total) varLabels: condition timepoint varMetadata: labelDescription head(pData(phenoData)) condition timepoint RM_0.5h_1 RM 0.5h RM_0.5h_2 RM 0.5h RM_0.5h_3 RM 0.5h LP_0.5h_2 LP 0.5h LP_0.5h_3 LP 0.5h LN_0.5h_2 LN 0.5h phenoData[c("RM_0.5h_1","LN_49h_3"),"condition"] An object of class 'AnnotatedDataFrame' rowNames: RM_0.5h_1 LN_49h_3 varLabels: condition varMetadata: labelDescription
Annotations
experimentData <- new("MIAME",name="Kian Draper",lab="Stephen Rader Lab",contact="draper@unbc.ca",title="C merolae growth in LN",other=list(notes="Created from text files"))
exampleSet <-ExpressionSet(assayData=exprs,phenoData=phenoData,experimentData=experimentData)
exampleSet is the expression set I have made I then performed the GOexpress analysis using these commands:
GOexpress
library(GOexpress)
Prepare group factor to analyze
is.factor(exampleSet$"condition") [1] TRUE exampleSet$condition <- factor(exampleSet$condition)
Custom Annotations
library(biomaRt) listMarts(host='http://oct2017-plants.ensembl.org')
Update ensembl for the specified species and mart
ensembl <- useMart(biomart = "plants_mart", dataset = "cmerolae_eg_gene",host = "plants.ensembl.org") Determine which attributes and filters you need filters = listFilters(ensembl) filters attributes = listAttributes(ensembl) attributes
Put into tables with names
AllGO = getBM(attributes = c("go_id", "name_1006", "namespace_1003"), mart=ensembl) GOgenes = AllGenes = getBM(attributes = c("ensembl_gene_id", "go_id"), mart=ensembl) AllGenes = getBM(attributes = c("ensembl_gene_id","external_gene_name", "description"), mart=ensembl)
change column names
colnames(GOgenes)[1] = 'gene_id' colnames(AllGenes)[1] = ‘gene_id'
Run random forest algorithm
set.seed(4543) exampleSet_results <- GO_analyse(eSet = exampleSet, f = “condition", GO_genes = GOgenes, all_GO = AllGO, all_genes = AllGenes)
Permutation for P value based on ontologies
exampleSet_results.pVal = pValue_GO(result=exampleSet_results, N=1000)
Filtering the results
BP.5 <- subset_scores(result = exampleSet_results.pVal,namespace = "biological_process",total = 5, p.val=0.05) MF.10 <- subset_scores(result = exampleSet_results.pVal,namespace = "molecular_function",total = 10, p.val=0.05) CC.15 <- subset_scores(result = exampleSet_results.pVal,namespace = "cellular_component", total = 15, p.val=0.05)
Determine the top ranking GO terms & make a heat map for top ranking GO term
head(BP.5$GO) heatmap_GO(go_id = "GO:0033014", result = BP.5, eSet=exampleSet, cexRow=0.4, cexCol=1, cex.main=1, main.Lsplit=30) heatmap_GO(go_id = "GO:0007018", result = BP.5, eSet=exampleSet, cexRow=1, cexCol=1, cex.main=1, main.Lsplit=20)
Genes associated with GO term of interest
table_genes(go_id = "GO:0033014", result = BP.5)[,c(1:3)] expression_plot(gene_id = "CME132C", result = BP.5, eSet=exampleSet, x_var = "timepoint", title.size=1.5,legend.title.size=10, legend.text.size=10, legend.key.size=15, ylim=c(0,10000))
At this point many of the genes from table_genes have a score of 0, which seems to be bringing down the average score which is the score for the GOterm. I understand that you can specify the minimum number of genes in a GOterm with "total= " but I am wondering if there is a way to remove all genes with a score of 0?