Choice of normalization method for spike-in data
1
0
Entering edit mode
7 weeks ago
eashby47 ▴ 10

Hello,

I am seeking some advice on how to pair a normalization method with spike-in yeast DNA in a time course CUT&RUN experiment with 5 timepoints and 3 replicates per time point. I am using MACS3 to perform peak calls and DiffBind to test for differential binding (using the DESeq2 workflow) between the 0 hr timepoint and subsequent timepoints.

I have tried (a) normalizing tag counts according to spike-in library size and (b) normalizing according to spike-in relative log expression (RLE). In Figure 32 on page 55 of the Diffbind vignette, these two normalization method produce nearly identical results.

However, in my case, I am getting different results. Namely, one time point always produces 0 differentially bound sites, and choice of normalization method determines at which time point the 0-result occurs:

When I normalize by spike-in library size, I obtain the following results table (Group, refers to the time point of comparison, Group2 refers to the 0 hr timepoint, and DB.DESeq2 shows number of differentially bound sites FDR<0.05 at that contrast):

enter image description here

But when I normalize by spike-in RLE, I obtain the following results table:

enter image description here

When I look at the normalized tag count profiles for the top 5 sites identified in each normalization setting, the tag count profiles display different patterns which likely account for the differences in the dba results.

enter image description here

Shown below are my spike-in library sizes per replicate (colored by timepoint). Note that the 0 hr and 24 hr groups have the most variability in the spike-in library size between replicates.

enter image description here

I have two questions that I would like feedback on:

  1. How do we choose an appropriate normalization method to pair with spike-in data? I understand that library size vs. RLE normalization methods have different underlying assumptions, but I don't know how to assess whether these assumptions are reasonable in the spike-in data. For example, an assumption of RLE normalization is balanced differential expression; this doesn't make any sense to me in the context of a spike-in.
  2. Why are my results so different between these two normalization methods? Is the variability in spike-in library sizes driving a wedge between these normalization methods somehow?

I would appreciate any advice! Thanks! -Ethan

spike-in DiffBind • 286 views
ADD COMMENT
1
Entering edit mode
7 weeks ago
ATpoint 53k
<tl_dr>

I have always been sceptical towards the use of spike-in in any NGS experiment because what you do is to add a constant amount of DNA to the library. It does not tell you anything about the data quality which can be very different in libraries. And if so then normalizing to a constant spike-in makes little sense. Imagine library1 is very good and you have 30% of reads overlapping peaks. Imagine library2 is crap and you have 1% reads overlapping peaks. If you scale them to a constant spike-in then this does not respect the difference in signal-to-noise ratio. I do not see any reason to use spike-ins unless you expect a global alteration in chromatin profile, in which case it would be the only option to use spike-ins, but it would be questionable if a NGS assay would be the experiment of choice. Others may have different opinions here.

CUT&RUN/TAG is not fundamentally different (in my understaning) to other enrichment-based DNA-seq assays such as ChIP or ATAC-seq. Hence the same normalization methods do apply. I personally like to use the methods from DESeq2 (often called RLE) followed by making MA-plots to inspect normalization. The easiest would be to simply run the standard DESeq2 pipeline and then from the results plot baseMean (on log2 scale) versus fold change, e.g. using the plotMA function from DESeq2 or any custom implemntation. The bulk of points should be somewhat centered along y ~ 0, and the "arrowhead". If that is the case then for me this indicates a good normalization. In case this is not the case you might need to tweak normalization, e.g. only using a subset of peaks for the normalization. I often use only the top 5000 peaks with the highest baseMean (so average counts between both conditions) as these regions have high counts in both conditions and as such are likely (in my head) to not change and as such are most reliable for normalization. After all a common assumption (e.g. for most differential analysis frameworks) is that most regions do not change between conditions or (how Michael Love the DESeq2 developer phrases it) that "the median ratio captures the size relationship. This is what MA-plots visualize.

I do not have an impressive set of example data to illustrate what I said above about choice of regions for normalization at hand right now, but this here should basically reflect it (towards the principle). It's two ATAC-seq samples with a lot of differential regions normalized with either all regions or only the top-5000 and I think top-5000 does a slightly better job as it better centers this "arrowhead" on the farleft at y=0. As said, this is just to illustrate what I described, the difference is not extreme here but if you have lots of changes going on in your data then choice of regions may notable influence the outcome.

enter image description here

Does that make sense to you? Feel free to add some of these plots in case you need feedback. This here is just my personal go-to strategy, others may have different opinions. In any case this here is a data-driven way of doing it as these plots really allow to "look at" you data and decide whether normalization went well rather than doing some blackbox-thing and just blindly trusting any method you choose.

ADD COMMENT
1
Entering edit mode

Thanks for the thoughtful comment! Here are a few things I wanted to add:

  1. In my experiment, data quality appears pretty consistent between replicates: every sample has between 2-4% of reads in peaks. These percentages are probably on the low side because I called peaks very stringently (intersecting between all 3 replicates produced ~500 consensus peaks) and total libraries sizes are fairly large (~15 million reads per sample).
  2. I believe that there IS a highly asymmetric change in differential binding. Hence, I believe a normalization method that produces points centered/balanced around log2FC=0 would not be appropriate for my experiment. This is why I would like to use spike-ins to normalize my data, though it is unclear to me whether I should normalize according to spike-in library sizes or size factors using RLE on the spike in reads. In particular, I'm unsure how to evaluate the validity of RLE's assumptions (balanced differential expression, up/dowregulated features behaving the same) in a spike-in setting.

I'll also add that the lists of differentially bound sites between spike-in library size and spike-in RLE normalizations are highly concordant (Jaccard index = 0.87), but the results of individual contrasts and tag count profiles are quite different between these two methods.

Diffbind suggests (in the absence of spike-in or parallel factor controls) normalizing using RLE with large background bins where any local enrichment is diluted out. When I perform background normalization, all the contrasts show non-zero differentially bound site calls (more in line with what I would expect). The tag count profiles for the top 10 genes in each contrast are shown below. Might background normalization be a better strategy in this case?

Figure 1

ADD REPLY
1
Entering edit mode

I would really inspect different methods using the MA-plots. I personally have never been an advocate for the background binning strategy as you are essentially normalizing on mostly the noise outside of peaks which I find a little odd. As said, try different methods, inspect MA-plots, see how the different methods center the farright part of the plots at the y-axis, and then decide for the method which, with your knowledge of the experiment and expectations towards it, fits it best.

ADD REPLY
0
Entering edit mode

Many thanks! I appreciate your feedback!

ADD REPLY

Login before adding your answer.

Traffic: 1006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6