Hi all!
I was trying to compare my ChipSeq replicates. I have one Input sample (IGG), one background (H3) and Three modifications each having two replicates; H3K27ac (rep1,rep2), H3K4me1 (rep1,rep2), and H3K27me3 (rep1,rep2). These are biological replicates.
I wish to compute the correlation between replicates. To do so I started using Deeptools plotcorrelation that uses a compressed numpy file generated by multibamsummary using the following command.
multiBamSummary bins -b *sorted.bam -o bowtie_readCounts.npz --outRawCounts bowtie_readCounts.tab
Most of the correlation is calculated by dividing the genome into bins (say 10K) and then counting no.s of reads falling in those bins. I have following queries 1. Before calculating correlation, how to know if your data (sequencing data here) is normally distributed or not because Pearson correlation is checked for data having a normal distribution. Is chip seq data not normally distributed because we have read enrichment occurring in a few specific regions?
- The description of deeptools here say
Pearson is an appropriate measure for data that follows a normal distribution, while Spearman does not make this assumption and is generally less driven by outliers, but with the caveat of also being less sensitive.
So should I use Pearson for my analysis? The galaxy tutorial here also uses Pearson for this purpose. However, post here mentions
Make sure you use --corMethod spearman for the plot though. Using Pearson's for this would be a crime against statistics since the signal is not even close to being either normally distributed or linear
and a further explanation by John is too difficult for me to understand. Can someone explain to me the rationale of using one of the two methods in an easy and comprehensible language.
Also when I plot these two graphs, Spearman shows me less correlation between replicates as shown below.
The Pearson however shows my Input sample is highly correlated with all the test samples (Histone modifications).
I used the following syntax to plot these graphs
plotCorrelation -in bowtie_readCounts.npz -c pearson --skipZeros --plotTitle "Pearson Correlation of All Replicates and Input DNA" --plotNumbers --removeOutliers --whatToPlot heatmap -o Heatmap_All_Pearson_corr.png --outFileCorMatrix Heat_Pearson_Corr_matrix.tab
If chip-seq is compositional, then neither, since it'll perform worse than random. See https://www.nature.com/articles/s41592-019-0372-4