How To Assign A Chip-Chip/Chip-Seq Peak To A Target Gene?
4
8
Entering edit mode
10.9 years ago

Given a set of bound regions for transcription factor identified by ChIP-chip or ChIP-seq, how do you find the regulated target gene?

AFAIK, the method of choice seems to answer this question seems to be the ad hoc approach to "find the nearest gene or TSS", as implemented in the ChIPpeakAnno Bioconductor package. But since cis-regulatory elements can skip over "bystander" genes and act on non-neighboring genes, clearly something more sophisticated is needed to generate better target gene assignment.

I've seen one paper that integrates (i) distance to genes with (ii) expression data from knockout studies and (iii) prior data to prioritize target genes for bound regions, but there is no code available on their webpage. This group does provide a web application to store and browse target gene assignment, but I was hoping to find additional code that does this automatically. Additional papers outlining strategies that can solve this task would be welcome as well.

chip-seq chip-seq target papers • 9.1k views
ADD COMMENT
10
Entering edit mode
10.9 years ago

I am the person guilty of the method described on the Furlong Lab web site. I could dig up the scripts for you, but I think you would be better off reimplementing it. The scripts are written in Perl, and it clearly shows that I was playing around trying to come up with method that would work well rather than knowing up front how to go about it. What is needed is thus really a complete rewrite (perhaps in R to make it more easily usable for the array community) and not just a code cleanup.

The idea behind the method is really quite simple. You calculate separate scores for each kind of evidence for each gene and multiply them up. The score for the ChIP data is calculated from the distance between the gene and the closest TF binding site identified, using a sigmoid (or something similar) to assign a perfect score of 1 for genes close to a binding site, gradually dropping off, and a score of 0 to genes beyond some distance. For the expression data, qvalues were similarly converted to scores between 0 and 1 (I think the formula was score=1-4*qvalue).

EDIT:

The original Perl scripts can be found here:

They are a bit too large and ugly to put as code blocks, so I deposited them on Box.net instead.

ADD COMMENT
1
Entering edit mode

Thanks Lars. Posting the code would be useful for us, and perhaps others as well, so if you can dig out a copy that would be much appreciated.

ADD REPLY
0
Entering edit mode

Scripts added as requested. The software is provided "as is" without warranty of any kind, express or implied, including the warranties of merchantability, fitness for a particular purpose, noninfringement and sanity after trying to understand it ;-)

ADD REPLY
0
Entering edit mode

This is great! Everything is more or less clear, and you are right - a rewrite in R would make a great project.

ADD REPLY
0
Entering edit mode

Am I right if I say that this method doesn't define CRMs for groups of peaks? And if one only has chip-seq data and no expression values, will then be essentially just associating peaks to nearest TSS based on distance?

ADD REPLY
0
Entering edit mode

Yes, that is correct - the whole point of the method was to combine ChIP and expression data.

ADD REPLY
6
Entering edit mode
9.9 years ago

Review paper on CRMs, not target gene assignment (thx Casey):

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001020

and recently published:

http://bioinformatics.oxfordjournals.org/content/27/23/3221.abstract

ADD COMMENT
1
Entering edit mode

The first paper is about CRM prediction evaluation, not target gene assignment, but the second one looks very relevant. Many thanks!

ADD REPLY
5
Entering edit mode
10.9 years ago

Here is a similar method recently published in NAR: they use cross-species synteny, GO similarity TF/the flanking gene and the distance between TF and flanking gene in a protein-protein-interaction network. They then train a random-forest classifier (whatever that is, looks like decision tree) from this data on a manual test set using these attributes and say that it performs better than the closest-gene approach.

ADD COMMENT
0
Entering edit mode

I don't see a link to the code in the paper. Is this available somewhere or upon request to the authors?

ADD REPLY
2
Entering edit mode
10.6 years ago
Patrick ▴ 20

Two options you may want to consider are relying upon association by eQTL studies and GRAIL analysis. There are a few eQTL datasets available (expressed quantitative trait locus) which link SNPs to changes in gene expression. I'm sure you could use genomic position or find SNPs covered in your peaks. Of course, cis-regulation is cell-type and stimulus-type specific so you may be in trouble if you have the wrong cell type. GRAIL (Gene Relationships Across Implicated Loci) is another good alternative if you want to assay for commonality between implicated, close-by genes.

Now, in my opinion from a molecular biology upbringing, associations based upon eQTL studies and proximity still require experimental validation before you can be sure any particular transcription factor binding site has function. Consider knocking down/out the transcription factor or overexpressing the transcription factor and assaying for expression.

ADD COMMENT
0
Entering edit mode

Thanks for the suggestions. I agree that GWAS/eQTL+ChIP-seq may be a powerful combination of approaches in the future, though I'm afraid that he GRAIL approach may not give direct enough links between TFs and their targets.

ADD REPLY

Login before adding your answer.

Traffic: 1624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6