ChIP-seq visualization: Is it valid to do a coverage normalization in addition to applying a spike-in-derived scaling factor?
1
1
Entering edit mode
8 months ago
kalavattam ▴ 190

Let's say a ChIP-seq sample is scaled in the following way:

  1. Calculate the percent spike-in reads in the immunoprecipitate (IP)
  2. Calculate the percent spike-in reads in the input
  3. Calculate the scaling factor by dividing the input spike-in percentage by the IP spike-in percentage, i.e., scaling factor = input spike-in % ÷ IP spike-in %
  4. Normalize all scaling factors for the samples under comparison by dividing by the highest or lowest scaling factor, depending on the downstream application.
  5. Apply the scaling factor using, for example, bamCoverage:
    bamCoverage \
     -b "${bam}" \
     --binSize "${bin_size}" \
     --scaleFactor ${scaling_factor} \
     -o "${bigwig}"
    

Is it valid to do a coverage normalization in addition to the application of the scaling factor?

# Say, for example, RPKM (reads per kilobase per million mapped) normalization...
bamCoverage \
    -b "${bam}" \
    --binSize "${bin_size}" \
    --scaleFactor ${scaling_factor} \
    --normalizeUsing RPKM \
    -o "${bigwig}"

# Or perhaps BPM (bins per million) normalization (same as TPM in RNA-seq)...
bamCoverage \
    -b "${bam}" \
    --binSize "${bin_size}" \
    --scaleFactor ${scaling_factor} \
    --normalizeUsing BPM \
    -o "${bigwig}"
ChIP-seq normalization scaling-factor coverage spike-in • 1.4k views
ADD COMMENT
2
Entering edit mode
8 months ago

That's not possible (or at least won't do what you intend). The source code for bamCoverage includes:

if args.normalizeUsing == 'None':
    args.normalizeUsing = None  # For the sake of sanity
elif args.normalizeUsing == 'RPGC' and not args.effectiveGenomeSize:
    sys.exit("RPGC normalization requires an --effectiveGenomeSize!\n")

if args.normalizeUsing:
    # if a normalization is required then compute the scale factors
    bam, mapped, unmapped, stats = openBam(args.bam, returnStats=True, nThreads=args.numberOfProcessors)
    bam.close()
    scale_factor = get_scale_factor(args, stats)
else:
    scale_factor = args.scaleFactor

So providing an argument to --normalizeUsing will overwrite your provided scale factors.

ADD COMMENT
0
Entering edit mode

jared.andrews07 — Thank you for your input and for taking the time to review the code.

Based on my analysis of the deepTools code, I believe that, when invoking deepTools' bamCoverage from the command line, any custom scaling factor assigned through --scalingFactor ${value} (where variable value is a user-provided positive integer or float) is recognized by the function get_scale_factor() (which is called by bamCoverage.py) and thus included in any subsequent RPKM (or RPGC or CPM or BPM) calculation. This is because all command-line argument values, including ${value} from --scalingFactor, are assigned to the object args in bamCoverage.py; the object args is then passed to the get_scale_factor() function as an argument, meaning that args.scaleFactor is accessible to code within the function get_scale_factor().

Within getScaleFactor.py, we see

    elif args.normalizeUsing == 'RPKM':
        # Print output, since normalzation stuff isn't printed to stderr otherwise
        sys.stderr.write("normalization: RPKM\n")

        # the RPKM is the # reads per tile / \
        #    ( total reads (in millions) * tile length in Kb)
        million_reads_mapped = float(bam_mapped) / 1e6
        tile_len_in_kb = float(args.binSize) / 1000

        scale_factor *= 1.0 / (million_reads_mapped * tile_len_in_kb)

        if debug:
            print("scale factor using RPKM is {0}".format(args.scaleFactor))

The scale_factor is multiplied by 1.0 / (million_reads_mapped * tile_len_in_kb), making a new "RPKM-transformed" scaling factor. Back in bamCoverage.py, this updated value for scale_factor is stored and made available to subsequent code. Ultimately, the scaling factor is applied to coverage/signal and written out to a bedGraph or a bigWig via the function writeBedGraph.scaleCoverage.

Anyway, I hope I'm not coming off as pedantic with all of this. The reason I bring this all up is because I work with researchers who have historically called bamCoverage like this:

bamCoverage \
    -b "${bam}" \
    --binSize "${bin_size}" \
    --scaleFactor ${scaling_factor} \
    --normalizeUsing RPKM \
    -o "${bigwig}"

This seems weird to me and I want to check my understanding with people that have knowledge of and experience with calculating and plotting "coverage" for ChIP-seq (and other NGS) assays.

The reason this seems weird to me is that, in our specific analyses, we have calculated scaling factors based on spike-in reads. We use spike-ins because we expect genome-wide shifts in coverage between our control and experimental "models"; these shifts would not really be apparent without spike-in normalization.

  1. Wouldn't applying an additional normalization to the spike-in normalization skew the resulting coverage, potentially altering the appearance of genome-wide shifts?
  2. For (rough) inter-sample comparisons, would it not be preferable to work with counts following the application of the scaling factor rather than the counts following both a scaling-factor normalization and RPKM normalization?

Also, if it is valid to apply an additional normalization, I recognize that B/TPM normalization should be favored over R/FPKM normalization since, per this and other papers,

The intended meaning of [R/FPKM] is a measure of relative RNA molar concentration (rmc) of a transcript in a sample. If a measure of RNA abundance is proportional to rmc, then their average over genes within a sample should be a constant, namely the inverse of the number of transcripts mapped. Unfortunately, [R/FPKM] does not respect this invariance property and thus cannot be an accurate measure of rmc (Wagner et al. 2012). In fact, the average [R/FPKM] varies from sample to sample. Therefore, [B/TPM] ([bins/transcripts] per million), a slight modification of [R/FPKM], was proposed (Li and Dewey 2011; Wagner et al. 2012).

If you made it this far, thanks for reading it all.

ADD REPLY
2
Entering edit mode
  1. Yes, an additional normalization would skew the results and is definitely now what you want since you have a spike-in normalization.
  2. Yes, that's why multiBamSummary can produce a DESeq2-equivalent scaling factor which could be plugged in here.
ADD REPLY
1
Entering edit mode

Very good digging. As for your questions:

Wouldn't applying an additional normalization to the spike-in normalization skew the resulting coverage, potentially altering the appearance of genome-wide shifts?

I actually don't know. I may have to try this to see how it impacts things. Or if you do, I'd be super interested in the results.

For (rough) inter-sample comparisons, would it not be preferable to work with counts following the application of the scaling factor rather than the counts following both a scaling-factor normalization and RPKM normalization?

I would think so, and the DiffBind vignette (section 7.6) seems to support this.

I'll see if I can get the deeptools dev to weigh in, I'd like to hear his thoughts on this.

ADD REPLY
0
Entering edit mode

Thank you. I agree that, for the purpose of illustration, it's a good idea to post some examples of coverage plots with (1) no normalization, (2) spike-in normalization, (3) both spike-in and RPKM normalization. I'll do that soon.

This was written after the post from Devon Ryan, so we know that applying both spike-in and RPKM normalization is an invalid approach. To accompany that post—and for anyone who has this question and reads this post in the future—it'll be good to show why.

ADD REPLY

Login before adding your answer.

Traffic: 1628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6