Question: How to do gene length bias corrected KEGG pathway analysis for non-native species in goseq?
1
gravatar for CandiceChuDVM
18 months ago by
CandiceChuDVM1.4k
United States/College Station/Texas A&M University
CandiceChuDVM1.4k wrote:

I am using goseq to correct for RNA-seq length bias. Since my dog genome (canFam3) is not supported in goseq database, I tried to make TXDb object and GOmap by myself. However, I am not sure if I am doing it correctly because:

  1. Here is my PWF plot
    pwf plot

  2. When I applied GO.MF=goseq(pwf,gene2cat=GOmap, test.cats=c("GO:MF")), I still got all the GO categories.

  3. The build-in KEGG pathway analysis doesn't work for non-native gene identifier.

So, I don't even sure if I can trust my results. If it's correct, it could potentially be a tutorial and benefit other users since the goseq documentation isn't very self-explanatory.

Does anyone have better solution about doing gene length bias corrected KEGG pathway analysis for non-native species in goseq? Any recommendation?

The following is my R script:

Get DESeq2 results

countdata=read.table("all.genes.txt", sep=" ", header=TRUE, row.names=1)
condition <- factor(c(rep("control",2), rep("rapid",3), rep("slow",3),
                  rep("control",2), rep("rapid",3), rep("slow",3),
                  rep("control",2), rep("rapid",3), rep("slow",3)))
timepoints <- factor(c(rep("t1",8), rep("t2",8), rep("t3",8)))
sampleTable <- data.frame(condition = as.factor(condition),
                      timepoints = as.factor(timepoints))
d.deseq<-DESeqDataSetFromMatrix(countData = countdata,
                            colData = sampleTable,
                            design = ~condition+timepoints+condition:timepoints)
ddsTC<-DESeq(d.deseq, test="LRT", reduced = ~ condition+timepoints)
ddsTC$condition <- relevel(ddsTC$condition, ref="control")
resTC<-results(ddsTC, alpha = 0.05)

Get GO terms from biomaRt

resTC$ensembl <- sapply( strsplit( rownames(resTC), split="\\+" ), "[", 1 )
ensembl = useMart( "ensembl", dataset = "cfamiliaris_gene_ensembl" )
genemap <- getBM( attributes = c("hgnc_symbol","go_id","ensembl_gene_id"),
              filters = "ensembl_gene_id",
              values = resTC$ensembl,
              mart = ensembl )
idx <- match(resTC$ensembl, genemap$ensembl_gene_id)
resTC$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
resTC$go_id<- genemap$go_id[ idx ]

head(resTC)

log2 fold change (MLE): conditionslow.timepointst3 
LRT p-value: '~ condition + timepoints + condition:timepoints' vs '~ condition + timepoints' 
DataFrame with 6 rows and 9 columns
                  baseMean log2FoldChange     lfcSE      stat     pvalue
                 <numeric>      <numeric> <numeric> <numeric>  <numeric>
ENSCAFG00000000001 634.0943577     -1.4429324 0.5365985 8.8758883 0.06427766
ENSCAFG00000000002   0.7144562      0.8500395 5.1053260 0.8445657 0.93237552
ENSCAFG00000000005  16.6377634      0.8308960 0.7964801 3.6893245 0.44968051

                    padj            ensembl hgnc_symbol       go_id
               <numeric>        <character> <character> <character>
0.3191870 ENSCAFG00000000001       ENPP1  GO:0016787
NA ENSCAFG00000000002                        
0.8662957 ENSCAFG00000000005      PARD6G  GO:0005515

Make TXDb object

cfamiliaris_gene_ensembl <- makeTxDbFromBiomart(dataset="cfamiliaris_gene_ensembl")
txsByGene=transcriptsBy(cfamiliaris_gene_ensembl,"gene")
lengthData=median(width(txsByGene))
select_genes <- as.vector(names(lengthData)%in%names(genes))
new_lengthData <- lengthData[select_genes]

Make PWF plot

pwf=nullp(genes,bias.data=new_lengthData)

Make GOmap

GOlist=resTC[,c(7,9)]
GOmap = as.data.frame(GOlist)
GO.wall=goseq(pwf,gene2cat=GOmap)

head(GO.wall)

 category over_represented_pvalue under_represented_pvalue numDEInCat
 105 GO:0003824            4.722997e-11                1.0000000         66
 609 GO:0016627            7.830729e-06                0.9999998          6
 607 GO:0016620            1.674500e-04                0.9999779          8

numInCat
 105      359
 609        8
 607       24

                                                                                           term
 105                                                                              catalytic activity
 609                                    oxidoreductase activity, acting on the CH-CH group of donors
 607 oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor

ontology
 105       MF
 609       MF
 607       MF
 637       MF
 547       CC
 213       CC

Using GO.MF=goseq(pwf,gene2cat=GOmap, test.cats=c("GO:MF")) will give me the same result as GO.wall.

Since the "limiting GO categories and other category based tests" are not working for non-native gene identifier, the KEGG pathway analysis doesn't work for me.

Does anyone have better solution about how to get KEGG pathway analysis for non-native species that take gene length bias into account?

Thanks a lot,
Candice

ADD COMMENTlink modified 18 months ago by jimmy_zeng90 • written 18 months ago by CandiceChuDVM1.4k
0
gravatar for jimmy_zeng
18 months ago by
jimmy_zeng90
University of Macau
jimmy_zeng90 wrote:

Since I didn't use the package “goseq ” ever, I can't answer the question about this package.

But for doing KEGG pathway analysis, the majority us will choose enrichment, there's nothing matter to gene length, just choose the a list of interesting genes . and then do hyperGtest for the gene set.

Hope this can help you .

ADD COMMENTlink written 18 months ago by jimmy_zeng90

Based on my understanding, there are several papers talking about the influence of gene length bias and how to adjust for it. It made me feel uncomfortable to ignore the gene length bias and just pass the list of DE genes to gene ontology.

For your references: Gene ontology analysis for RNA-seq: accounting for selection bias Length Bias Correction in Gene Ontology Enrichment Analysis Using Logistic Regression

I hope that would help!

ADD REPLYlink written 18 months ago by CandiceChuDVM1.4k

wow. I didn't think the gene lenth bias before ever.

It seems not very popular, I didn't see any tutorial talk about it too.

For enrichment, just a gene list , is enough, I am pretty sure, it's just a statistics.

For choosing DE genes, that's a trick question, not only the gene length , but also the p-value , fold changes should be taken care of .

ADD REPLYlink modified 17 months ago • written 17 months ago by jimmy_zeng90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 923 users visited in the last hour