CENTIPEDE DNase-seq analysis - ylist question
Entering edit mode
7.2 years ago
rbronste ▴ 420

Hi All,

Trying to use CENTIPEDE to infer TF footprints. I made a PWM using bwtool of one meme TF matrix and my bigWig files from the DNase experiment identifying insertion events +/- 200bp around the center of the motif. At this stage a little confused with the CENTIPEDE instructions, specifically the example from their R help site in reference to fitting the model:

centFit <- fitCentipede(Xlist = list(as.matrix(NRSFcuts)), Y=cbind(rep(1, dim(NRSF_Anno)[1]), NRSF_Anno[,5], NRSF_Anno[,6]))

So the Xlist is the matrix that I have created but not sure exactly what the Ylist is supposed to be. If anyone can help thanks in advance!


centipede DNase-seq R • 2.2k views
Entering edit mode
7.2 years ago
ddiez ★ 2.0k

CENTIPEDE integrates experimental evidence with prior information to determine whether a particular genome location is bound by some transcription factor (or other DNA-binding protein). The X matrix includes the experimental evidence, for example the cuts inferred from DNaseI-seq. In the Y matrix you include the prior information, including how well that region matches the TF binding site (from the score obtained from the matching of the TF's PWM to that position), and the conservation of that genomic position (obtained, e.g. from phastCons scores). I think that is what NRSF_Anno[, 5] and NRSF_Anno[, 6] represent. I remember the documentation was a bit confusing but don't have it with me at this moment to check this in more detail.


Took a look at the package and this is a quick look at the content of NRSF_Anno:

  chrom hg18Start hg18End Strand PWMscore ConsScore TSSdist
1  chr1     90336   90356      - 16.69222   0.03875   31393
2  chr1    141061  141081      + 19.73801   0.18760   82118
3  chr1    236650  236670      - 16.69222   0.02165  120861
4  chr1    398305  398325      + 16.69222   0.29235   40794
5  chr1    571868  571888      - 16.69222   0.10220   40019
6  chr1    676751  676771      + 19.73801   0.05410   64864

As you can see, NRSF_Anno[, 5] is the PWMscore and NRSF_Anno[, 6] is the ConsScore (conservation score). In their paper the authors also used the distance to TSS (TSSdist) in the model.


A useful source of information regarding CENTIPEDE usage might be this tutorial in github.


This is OT but might be useful for others interested in this package. It seems CENTIPEDE cannot be installed in R-3.3.2 anymore:

install.packages("CENTIPEDE", repos="http://R-Forge.R-project.org", type = "source") 
Warning in install.packages :
  package ‘CENTIPEDE’ is not available (for R version 3.3.2)

I solved this by downloading the software from the SVN repository (from here) and creating an empty file named NAMESPACE in the root of the package. Then the package can be installed properly.

Entering edit mode

Thanks very much for the information, however I guess what I am confused by is where the NRSF_anno even comes from? I obtained a single matrix from bwtool that is indicative of how the coverage in my DNase bigWigs, and how they are oriented around the center of the meme motif output. So just trying to figure out what to make the annotation file from? Thanks again.

Rob. (was going by following instructions pulled from a paper, I do have the phyloP bw but not sure how to integrate it)

These count matrices were then used by CENTIPEDE along with conservation levels at corresponding positions (phyloP score from the placental subset of the UCSC 60-way genome alignment; Karolchik et al., 2014) to learn motif-specific models of Tn5 insertion density and predict the likelihood that each motif instance across the genome is bound. We used sites predicted with greater than 95% posterior probability to be occupied as our footprint set.

Entering edit mode

The following is from the Genome Research paper describing CENTIPEDE and is how they arrive at the NRSF_anno file, I am just not sure how to pull the data out once I have the PWM that is meme motif positions in mm10/my open chromatin data:

For each can- didate, we extracted genomic information that would be included in the model prior: sequence conservation (Pollard et al. 2010); quality of the PWM match; and distance to the nearest transcription start site; as well as experimental data in a 200–400-bp window around the site to be used in the likelihood—DNase I sensitivity and ChIP-seq data on seven histone modifications, all from LCLs.

Entering edit mode

The idea behind CENTIPEDE, as I understand it, is that you start with the predicted locations of the DNA binding sites for one or more TFs. Then from those locations you obtain both the X and Y matrix. For the X matrix you use the experimental evidence from DNase-seq (or ATAC-seq or even histone marks). For the Y matrix you use the prior information. For example the PWMScore associated with a binding location is the score value you obtained from MEME. For the conservation you have one value for each nucleotide. What I did (if I recall correctly) is to compute the mean conservation in that location. For the distance you would have to compute the distance to all TSSs in the chromosome and then get the one that is closest, and so on.


Login before adding your answer.

Traffic: 2541 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6