I was trying to compare my ChipSeq replicates. I have one Input sample (IGG), one background (H3) and Three modifications each having two replicates; H3K27ac (rep1,rep2), H3K4me1 (rep1,rep2), and H3K27me3 (rep1,rep2). These are biological replicates.
I wish to compute the correlation between replicates. To do so I started using Deeptools plotcorrelation that uses a compressed numpy file generated by multibamsummary using the following command.
multiBamSummary bins -b *sorted.bam -o bowtie_readCounts.npz --outRawCounts bowtie_readCounts.tab
Most of the correlation is calculated by dividing the genome into bins (say 10K) and then counting no.s of reads falling in those bins. I have following queries 1. Before calculating correlation, how to know if your data (sequencing data here) is normally distributed or not because Pearson correlation is checked for data having a normal distribution. Is chip seq data not normally distributed because we have read enrichment occurring in a few specific regions?
- The description of deeptools here say
Pearson is an appropriate measure for data that follows a normal distribution, while Spearman does not make this assumption and is generally less driven by outliers, but with the caveat of also being less sensitive.
Make sure you use --corMethod spearman for the plot though. Using Pearson's for this would be a crime against statistics since the signal is not even close to being either normally distributed or linear
and a further explanation by John is too difficult for me to understand. Can someone explain to me the rationale of using one of the two methods in an easy and comprehensible language.
I used the following syntax to plot these graphs
plotCorrelation -in bowtie_readCounts.npz -c pearson --skipZeros --plotTitle "Pearson Correlation of All Replicates and Input DNA" --plotNumbers --removeOutliers --whatToPlot heatmap -o Heatmap_All_Pearson_corr.png --outFileCorMatrix Heat_Pearson_Corr_matrix.tab