GOSeq (very few enriched genes with p-value less than 0.05)
0
0
Entering edit mode
9.9 years ago
yfyian ▴ 40

I am currently using your GOSeq package to perform GO enrichment for my analysis. Since I am working on lettuce, I have to generate my own transcript length file. I was able to perform my analysis all the way through GO enrichment step. Unfortunately, based on my analysis, there is no enriched gene. The p-values are bigger than 0.05 for all genes. I am wondering why I obtained such result? Also, in one of the many analyses I have performed, there are very few genes (less than 5 genes) with p-values smaller than 0.05. Why are there so few or no over/under represented GOs? I attached my R command lines for reference.

Many thanks, Yoong

##################
## GOSEQ script ##
##################
library(goseq)

# Import data in the format: geneID, DE

data = read.table("data.txt", header = T, sep = "\t")

## obtain DE values from log2fold change:

  # add a DE column to the count table:
  data$DE = c(1:nrow(data)) # will later be filled with 0 or 1

  # get DE values: 1 if DE, 0 if not:
  k = 0 #counter for removed, non-DE genes
  threshold = 5 #set the threshhold for log2fold change (accounts for + and -)


  for (i in 1:nrow(data)) {

      if (data[i,]$log2FoldChange >= threshold | data[i,]$log2FoldChange <= - threshold) {
        print(data[i,]$log2FoldChange)
        data[i,]$DE = 1
      }
      else {
        data[i,]$DE = 0
        k = k+1
      } 
    i = i+1
  }
  cat("including",i-k," DE terms \n")

View(data)

## Obtain the weighting for each gene, based on the gene length

data_pwf = unique(data[,c(-3,-4,-5)])
# this step ensures we only keep the unique GENE ID's.

# method 1:
DE = data_pwf$DE
genes = setNames(DE, data_pwf$Gene_ID)
transcript_length = data_pwf$Transcript_length

# method 2:
DE2 = data$DE
genes2 = setNames(DE2, data$Gene_ID)
transcript_length2 = data$Transcript_length
random_length = c(1:nrow(data))

# calculate Probability Weighting Function

# method 1:
pwf = nullp(genes, bias.data = transcript_length)

# method 2:
pwf2 = nullp(DE2, bias.data = transcript_length2)
pwf2$Gene_ID = data$Gene_ID
pwf2 = unique(pwf2)
row.names(pwf2) = pwf2$Gene_ID
pwf2 = pwf2[,-4]

##GOSEQ Analysis

#create gene2catfile that contains the mapping between genes and categories:
gene2cat = data[,c(1,3)]
gene2cat = dataGO
#run goseq()
GO.wall2 = goseq(pwf2,gene2cat=gene2cat, method = "Wallenius")

#select for enriched terms with p.adjusted < 0.05:
print(p.adjust(GO.wall2$over_represented_pvalue))

enriched.GO = GO.wall2$category[p.adjust(GO.wall2$over_represented_pvalue, method = "BH")<1]

#visualize the results:
for (go in enriched.GO[1:10]) {
  print(GOTERM[[go]])
}
p-values enriched-genes GOSeq • 6.8k views
ADD COMMENT
2
Entering edit mode

You're only describing things as significantly different if the fold-change between your groups is at least 32x. That seems CRAZY high. Unless you don't have replicates, perform DE analysis with one of the standard tools and then use the p-values it produces to determine if things are DE or not.

BTW, the first dozen or so of your lines could be replaced by:

data y- read.table("data.txt", header = T, sep = "\t")
data$DE <- 0
data$DE[which(abs(data$log2FoldChange) >= 5)] <- 1

While I really recommend not doing things this way, you'll at least find that this is a bit faster than what had written.

ADD REPLY
0
Entering edit mode

Thank you for your quick reply, Devon!

I have 3 biological reps for my samples. I changed from using log2fold changes to p-values to determine which genes are DE. I changed the p-value threshold as well. I was able to have some genes to be enriched by GO. That leads me to another question. What if I am only interested in enriching genes that are found to be differentially expressed by standard tools (DESEq2, EdgeR, etc)? In such case, I will be left with genes that have less than 0.05 for p-adjusted values. I won't be able to filter on GOSeq which genes are DE based on p-adjusted values since everything will be less than 0.05. Is it possible to filter them based on what's being upregulated versus downregulated instead of non-DE-genes versus DE-genes?

In GOSeq, why is important to include non-DE genes in the enrichment analysis when only the DE genes will be enriched?

ADD REPLY
1
Entering edit mode

You seem to be rather confused about what GOSeq does (and GO enrichment in general). GOSeq performs GO enrichment, it has nothing to doesn't enrich genes. GO enrichment is done by looking at a set of genes labeled DE and seeing if there are more of them in a given GO category than one would expect, given the background non DE genes (without them, you'd have no way of knowing how likely/unlikely the observation is).

GOseq is intended to be used downstream of DESeq2/edgeR/etc., so it's expecting you to input lists according to the output of those tools (BTW, an adjusted p-value threshold of 0.05 is likely rather conservative for looking at GO enrichment). You can't describe something as being up/downregulated if it's not differentially expressed (aka, differentially regulated), by definition.

ADD REPLY
0
Entering edit mode

Thank you for the explanation! Things are a lot clearer to me now. I was confused about the function of GOSeq/GO enrichment in general.

ADD REPLY

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6