I want to raise an important point here concerning ChIPseq analysis with ngs.plot. I think you'll need to know about this utility to answer my concern (see https://github.com/shenlab-sinai/ngsplot). It is of rather statistical nature.
I leveraged ngs.plot to compare factor's occupancy on gene body of different gene families. I had a prior hypothesis in mind that this factor would have bigger genome occupancy on one class of genes vs another. When I did the analysis initially the result was exactly opposite to what I expected with big difference between two gene classes. After a while I repeated the analysis and was surprised to see the data in favor of my hypothesis (completely reversed). The only difference was that in the first analysis I used a list downloaded from HUGO while in second from ensemble. So I asked why, especially that these same gene symbols were in both list. The reason turned out to be due to length of gene list - there was approx. ~30 times more gene items in ensembl list than in HUGO list (even though they are the same type): the shorter the list the more enriched it appears. So clearly whether there will be an enrichment for a class of genes is a function of a number of gene items in the list. To further test that I truncated another class of genes and it significantly changed the outcome. (Note that I'm comparing ChIP to input control).
Now this is problematic. For example in this paper http://www.biomedcentral.com/1471-2164/15/284 . If you read figure legend of fig.4 you'll see that it states that heatmaps for different functional elements were resized and compared which suggests the number of genes used was different. And so although the difference may seem biologically meaningful it is in fact simply an artifact of the analysis. Also in the ngs.plot documentation they state that we can compare different gene lists (for example based on genes expression investigate how epigenomes correlate with transcription). Again we may see differences simply due to the number of items in gene list.
I thought using lists of equal lengths could be a solution. Obviously that means reducing the long list to the number of shortest list. The genes would be removed on the basis of enrichment - those with the lowest enrichment would be removed. Do you think it is a good solution?