How To Assign A Chip-Chip/Chip-Seq Peak To A Target Gene?
4
8
Entering edit mode
10.9 years ago

Given a set of bound regions for transcription factor identified by ChIP-chip or ChIP-seq, how do you find the regulated target gene?

AFAIK, the method of choice seems to answer this question seems to be the ad hoc approach to "find the nearest gene or TSS", as implemented in the ChIPpeakAnno Bioconductor package. But since cis-regulatory elements can skip over "bystander" genes and act on non-neighboring genes, clearly something more sophisticated is needed to generate better target gene assignment.

I've seen one paper that integrates (i) distance to genes with (ii) expression data from knockout studies and (iii) prior data to prioritize target genes for bound regions, but there is no code available on their webpage. This group does provide a web application to store and browse target gene assignment, but I was hoping to find additional code that does this automatically. Additional papers outlining strategies that can solve this task would be welcome as well.

chip-seq chip-seq target papers • 9.1k views
10
Entering edit mode
10.9 years ago

I am the person guilty of the method described on the Furlong Lab web site. I could dig up the scripts for you, but I think you would be better off reimplementing it. The scripts are written in Perl, and it clearly shows that I was playing around trying to come up with method that would work well rather than knowing up front how to go about it. What is needed is thus really a complete rewrite (perhaps in R to make it more easily usable for the array community) and not just a code cleanup.

The idea behind the method is really quite simple. You calculate separate scores for each kind of evidence for each gene and multiply them up. The score for the ChIP data is calculated from the distance between the gene and the closest TF binding site identified, using a sigmoid (or something similar) to assign a perfect score of 1 for genes close to a binding site, gradually dropping off, and a score of 0 to genes beyond some distance. For the expression data, qvalues were similarly converted to scores between 0 and 1 (I think the formula was score=1-4*qvalue).

EDIT:

The original Perl scripts can be found here:

They are a bit too large and ugly to put as code blocks, so I deposited them on Box.net instead.

1
Entering edit mode

Thanks Lars. Posting the code would be useful for us, and perhaps others as well, so if you can dig out a copy that would be much appreciated.

0
Entering edit mode

Scripts added as requested. The software is provided "as is" without warranty of any kind, express or implied, including the warranties of merchantability, fitness for a particular purpose, noninfringement and sanity after trying to understand it ;-)

0
Entering edit mode

This is great! Everything is more or less clear, and you are right - a rewrite in R would make a great project.

0
Entering edit mode

Am I right if I say that this method doesn't define CRMs for groups of peaks? And if one only has chip-seq data and no expression values, will then be essentially just associating peaks to nearest TSS based on distance?

0
Entering edit mode

Yes, that is correct - the whole point of the method was to combine ChIP and expression data.

6
Entering edit mode
9.9 years ago

Review paper on CRMs, not target gene assignment (thx Casey):

and recently published:

http://bioinformatics.oxfordjournals.org/content/27/23/3221.abstract

1
Entering edit mode

The first paper is about CRM prediction evaluation, not target gene assignment, but the second one looks very relevant. Many thanks!

5
Entering edit mode
10.9 years ago

Here is a similar method recently published in NAR: they use cross-species synteny, GO similarity TF/the flanking gene and the distance between TF and flanking gene in a protein-protein-interaction network. They then train a random-forest classifier (whatever that is, looks like decision tree) from this data on a manual test set using these attributes and say that it performs better than the closest-gene approach.

0
Entering edit mode

I don't see a link to the code in the paper. Is this available somewhere or upon request to the authors?

2
Entering edit mode
10.6 years ago
Patrick ▴ 20

Two options you may want to consider are relying upon association by eQTL studies and GRAIL analysis. There are a few eQTL datasets available (expressed quantitative trait locus) which link SNPs to changes in gene expression. I'm sure you could use genomic position or find SNPs covered in your peaks. Of course, cis-regulation is cell-type and stimulus-type specific so you may be in trouble if you have the wrong cell type. GRAIL (Gene Relationships Across Implicated Loci) is another good alternative if you want to assay for commonality between implicated, close-by genes.

Now, in my opinion from a molecular biology upbringing, associations based upon eQTL studies and proximity still require experimental validation before you can be sure any particular transcription factor binding site has function. Consider knocking down/out the transcription factor or overexpressing the transcription factor and assaying for expression.

0
Entering edit mode

Thanks for the suggestions. I agree that GWAS/eQTL+ChIP-seq may be a powerful combination of approaches in the future, though I'm afraid that he GRAIL approach may not give direct enough links between TFs and their targets.