Question

Which one to use for ChipSeq replicate comparison: Spearman or Pearson

1

Entering edit mode

5.7 years ago

rohitsatyam102 ▴ 940

Hi all!

I was trying to compare my ChipSeq replicates. I have one Input sample (IGG), one background (H3) and Three modifications each having two replicates; H3K27ac (rep1,rep2), H3K4me1 (rep1,rep2), and H3K27me3 (rep1,rep2). These are biological replicates.

I wish to compute the correlation between replicates. To do so I started using Deeptools plotcorrelation that uses a compressed numpy file generated by multibamsummary using the following command.

multiBamSummary bins -b *sorted.bam -o bowtie_readCounts.npz --outRawCounts bowtie_readCounts.tab

Most of the correlation is calculated by dividing the genome into bins (say 10K) and then counting no.s of reads falling in those bins. I have following queries 1. Before calculating correlation, how to know if your data (sequencing data here) is normally distributed or not because Pearson correlation is checked for data having a normal distribution. Is chip seq data not normally distributed because we have read enrichment occurring in a few specific regions?

The description of deeptools here say

Pearson is an appropriate measure for data that follows a normal distribution, while Spearman does not make this assumption and is generally less driven by outliers, but with the caveat of also being less sensitive.

So should I use Pearson for my analysis? The galaxy tutorial here also uses Pearson for this purpose. However, post here mentions

Make sure you use --corMethod spearman for the plot though. Using Pearson's for this would be a crime against statistics since the signal is not even close to being either normally distributed or linear

and a further explanation by John is too difficult for me to understand. Can someone explain to me the rationale of using one of the two methods in an easy and comprehensible language.

Also when I plot these two graphs, Spearman shows me less correlation between replicates as shown below.

The Pearson however shows my Input sample is highly correlated with all the test samples (Histone modifications).

I used the following syntax to plot these graphs

plotCorrelation -in bowtie_readCounts.npz -c pearson --skipZeros --plotTitle "Pearson Correlation of All Replicates and Input DNA" --plotNumbers --removeOutliers --whatToPlot heatmap -o Heatmap_All_Pearson_corr.png --outFileCorMatrix Heat_Pearson_Corr_matrix.tab

ChIP-Seq correlation replicates genome • 5.2k views

ADD COMMENT • link updated 5.7 years ago by colin.kern ★ 1.1k • written 5.7 years ago by rohitsatyam102 ▴ 940

1

Entering edit mode

If chip-seq is compositional, then neither, since it'll perform worse than random. See https://www.nature.com/articles/s41592-019-0372-4

ADD REPLY • link 5.7 years ago by jamietmorton ▴ 10

score 2 · Answer 1 · 2019-10-28

2

Entering edit mode

5.7 years ago

colin.kern ★ 1.1k

ChIP-seq data is definitely not normally distributed. The issue with spearman correlation, though, is that if a lot of your bins have low variance across samples, which is a given in ChIP-seq because most of your bins are going to be background regions rather than regions with signal, then because spearman is based on ranking there can be huge differences in the ranking of these background regions just due to variations in noise in the data. The --skipZeros argument isn't going to remove a lot of these background regions because that assumes your data is not noisy. I'm also not sure about the --removeOutliers argument. If that's identifying outliers using the standard method for normally distributed data, then it could be removing many of your important regulatory regions as outliers.

One thing you can try is to use bamCompare to create bigwig tracks with the IP signal normalized by the input, then do plotCorrelation based on those to see if the clustering you expect appears.

You can also try plotPCA, which by default only uses 1000 bins with the highest variance between samples. This might help cluster your libraries based on actual biologically meaningful differences instead of being dominated by variance in background/noise.

Finally, you can use a peak caller like Macs2 to get peak regions, then create a distance matrix based on overlap of peak regions using bedtools jaccard and do a heatmap/clustering based on that.

ADD COMMENT • link 5.7 years ago by colin.kern ★ 1.1k

2

Entering edit mode

Very nice answer. What you can also do is to select multiBamSummary BED-file --BED selection.bed --bamfiles file1.bam file2.bam -o results.npz and use a BED-file if you have already genomic regions of interest.

ADD REPLY • link 5.7 years ago by sim.j.baum ▴ 170

0

Entering edit mode

Thanks for your reply. The skipZeros dis-consider the bins containing zero reads in the bams that are being compared. If not invoked, these regions will otherwise show their own correlation.

--skipZeros By setting this option, genomic regions that have zero or missing (nan) values in all samples are excluded.

--removeOutliers If set, bins with very large counts are removed. Bins with abnormally high reads counts artificially increase pearson correlation; that’s why, multiBamSummary tries to remove outliers using the median absolute deviation (MAD) method applying a threshold of 200 to only consider extremely large deviations from the median.

I created Bigwig files using BPM (Bins Per million) read normalization which is equivalent to transcripts per million and is better method of normalization than FPKM or RPKM as explained here. However I see no improvement in the correlation plot.

ADD REPLY • link 5.7 years ago by rohitsatyam102 ▴ 940

1

Entering edit mode

Yes, skipZeros removes bins with values of zero in all samples, so it will remove bins in unmappable regions, but if you have just one read in one sample, then that bin is included, so it doesn't help to remove any of your background noise.

TPM's benefit over FPKM/RPKM is well documented in RNA-seq experiments, but because these bigwig files are generated using uniformly sized bins, I'm not sure if there's any meaningful difference between TPM and FPKM. The only difference between the methods is the order in which the two normalizations are done, one based on library size and the other based on gene length. But because your "gene lengths", or in this case bins, are all the same size, it should be equivalent to not normalizing by gene size at all, and in that case there'd be no difference between TPM and FPKM.

ADD REPLY • link 5.7 years ago by colin.kern ★ 1.1k