I want to raise an important point here concerning ChIPseq analysis with ngs.plot. I think you'll need to know about this utility to answer my concern (see https://github.com/shenlab-sinai/ngsplot). It is of rather statistical nature.
I leveraged ngs.plot to compare factor's occupancy on gene body of different gene families. I had a prior hypothesis in mind that this factor would have bigger genome occupancy on one class of genes vs another. When I did the analysis initially the result was exactly opposite to what I expected with big difference between two gene classes. After a while I repeated the analysis and was surprised to see the data in favor of my hypothesis (completely reversed). The only difference was that in the first analysis I used a list downloaded from HUGO while in second from ensemble. So I asked why, especially that these same gene symbols were in both list. The reason turned out to be due to length of gene list - there was approx. ~30 times more gene items in ensembl list than in HUGO list (even though they are the same type): the shorter the list the more enriched it appears. So clearly whether there will be an enrichment for a class of genes is a function of a number of gene items in the list. To further test that I truncated another class of genes and it significantly changed the outcome. (Note that I'm comparing ChIP to input control).
Now this is problematic. For example in this paper http://www.biomedcentral.com/1471-2164/15/284 . If you read figure legend of fig.4 you'll see that it states that heatmaps for different functional elements were resized and compared which suggests the number of genes used was different. And so although the difference may seem biologically meaningful it is in fact simply an artifact of the analysis. Also in the ngs.plot documentation they state that we can compare different gene lists (for example based on genes expression investigate how epigenomes correlate with transcription). Again we may see differences simply due to the number of items in gene list.
I thought using lists of equal lengths could be a solution. Obviously that means reducing the long list to the number of shortest list. The genes would be removed on the basis of enrichment - those with the lowest enrichment would be removed. Do you think it is a good solution?
I don't know the implementation of NGS.Plot. But the profiles shown are normally averages, maybe with some indication of variance and the size of a list should not matter.
I can think of 2 reasons for the size of a list changing the magnitude:
a) very very few genes being sensitive to outliers with very strong or very little enrichment
b) in ensembl compared to HUGO you have many transcripts for the same gene* all with more or less the same coverage. This makes some genes with many transcripts have more weight. Adding or removing these might impact the profile strongly.
* I don't know if this is really the case for you, since you did not show any data. But I could not explain 30x difference for a gene symbol list differently.
You cannot do any kind of Bioinformatic analysis with the expectation that the tool will do the right thing for you, or even knows what it is you're trying to do. I think 'Fundamental Flaw in ngs.plot' is a little strong - i've never used the tool, but it sounds like its doing exactly what it says it does which is calculating total signal distribution over genomic regions :)
Perhaps you should take a step back and redefine your question. If your question is "what is the occupancy between two samples for HUGO genes", then accept your null hypothesis. If it's "occupancy for Ensembl genes", accept your hypothesis. If this situation seems unsatisfactory to you, think of a better hypothesis :)
I suspect that your looking for a hypothesis along the lines of "for genes *expressed in my cells*, there is a difference in occupancy". This is a very different question than 'all genes' since you first have to curate your list down to just the genes actually being expressed (and thus could have your transcription factor). Thus, its not the length of the gene list per se, but rather the fact that the extra genes in the Ensembl annotation are more likely to be rare, cell type specific genes, and most will not targets for your ChIP. These extra genes will dilute out your signal - unless you only select the genes actually being transcribed.
How to find out whats being transcribed? Use RNA-Seq, or only use genes which overlap with TF peaks for both samples (or each one, and accept that the different lists is an OK thing)
I like your suggestion of using genes that are actually expressed or in close proximity to the peaks and I should have considered that. As you guessed refining a hypothesis on the basis of whether it was downloaded from ensembl or HUGO is rather unsatisfactory :) Actually I decided to use ensembl list because it was much more richer and was taken aback by the completely different result even though the same class of genes was used - bigger gene list diluted the overall signal. It is not that I want the analysis to show what I want it to show but I'm simply being wary and don't want to mislead a potential paper reader without me even noticing it.
Having said that I do think that if one wants to compare different gene families for factor's occupancy on gene body the number of genes in the list ought to be the same. Why? Because overall coverage depends on (is function of) number of items on the gene list ( see point 8 of implementation here https://github.com/shenlab-sinai/ngsplot/wiki/HowAreYaxisValsCalculated ) whether the genes are expressed or not. Hence if the question is whether genome occupancy for a gene class vs another has to be answered then gene list length to be compared must be equal.
Let me illustrate: say we are comparing class A to class B. Class A has 5 genes while class B has 50 genes and all these genes are in proximity to peaks so can be included in the analysis. Now I assume for each gene in class A there are 10 reads aligning and hence the average profile will be 10. While in class B the first 10 genes have 20 reads while none reads in rest and hence average profile will be 4. The final conclusion is that factor is occupying class A more significantly than class B. But is this true? You can see that the top 10 genes of class B had bigger occupancy than combined mean coverage of class A. For me this is problematic.
The way I want to do it then is as follows:
1) get profiles for class B and order from highest to lowest enrichemnt
2) take top five genes from class B and repeat the procedure in parallel to A (which also has five members) to compare the two groups.