What is the best way to do pathway analysis computational for a set of genes or proteins of interest. Specifically I am trying to identify common functions or pathways in a set of genes mutated in cancer samples. I know I could look at Go terms, and use things like David. Anyone have some other really good techniques for this?
ConsensusPathDB is a meta-search engine for pathway analysis. it basically incorporates all/most of the reputable public access pathway databases out there.
one major source outside of cpdb is ingenuity IPA. this is proprietary software and (in addition to public access database info) has a manually curated database of millions of pathway "associations" mined from academic papers.
between these 2, i think you can capture most compiled pathway info.
There are a lot of posts here and elsewhere about pathway analysis. How you go about it depends on what data you have and what you want to see. This post and the review it refers to are good places to start: http://gettinggeneticsdone.blogspot.com/2012/03/pathway-analysis-for-high-throughput.html
To begin with there is no single best method. It is always depend on the data you have in hand.
"Gene Ontology enrichment analysis != Pathway analysis"
For a detailed explanation of GO term enrichment see this previous discussion at Biostars.
You mentioned that
I am trying to identify common functions or pathways in a set of genes mutated in cancer samples.
I assume your data could have come from an genome/exome/transcriptome analysis workflow. If your list of genes are from an exome or genome workflow the approach discussed in the previous answers will be enough but you need to take care of few important things.
To do a pathway analysis you primarily need
- List of background genes
- List of perturbed genes,
- Annotation file that map each gene to a pathway
Now you have to be very careful when you define your background. If your data is from a tumor - normal pair your background should only contain the genes that are specific to the cell-line or tissue of your interest. Consult databases like HPRD/Human Protein Atlas to find cell/tissue specific genes. Once you have this data/files you can perform enrichment analysis (standard statistical test followed by multiple testing correction) using R to see significant pathways. You can use external tools only if they allow you to input a user-defined / experimental platform specific background.
If your data is from transcriptome/RNA-Seq you may use GOSeq: It uses a statistical approach developed specifically for RNA-seq data that can incorporate length or total count bias of RNA-Seq data into gene set tests.
You may also refer to a previous post here
There are many, many potential methods here:
Getting GO terms is a good start, but even here the level of curation is mixed.
Always remember to use a word of caution with pathway analyses, and have a plan for how to biologically validate your results if you plan to publish. Most publicly available analysis algorithms work from publicly available data -- and these data are just not complete for most genes of interest. This is true for online web tools such as String and GeneMania -- but if filtered with the most stringent search criteria, interesting connections can be found. Also take a look at the NCI Pathway Interaction Database.
Do you have questions about how to approach specific hypotheses through pathway analysis?
you can use my package http://www.bioconductor.org/packages/2.11/bioc/html/ReactomePA.html for reactome pathway analysis