Question: Normalizing Rna-Seq Data Using Ercc Spike-In
gravatar for Eric Fournier
4.7 years ago by
Eric Fournier1.4k
Quebec, Canada
Eric Fournier1.4k wrote:


I have sequencing data which has been spiked-in with exogenous ERCC controls to allow normalization. I have aligned the reads using tophat 2, and evaluated their abundance using cufflinks. To normalize transcript abundances and get a measure which is directly comparable between replicates, my strategy up to now has been to take a transcripts FPKM and divide it by the sum of FPKMs of all exogenous ERCC controls.

This seems to work, but now I am left with a question: by dividing FPKMs by FPKMs, am I not cancelling out the part of the FPKM calculation which accounts for the number of reads? IE, FPKM is Fragment per Kilobase of Exons per Million of reads. For any given calculation, the "Kilobase of exons" is a characteristic of the transcript, and is identical for all ERCC transcripts across all replicates, so all of my calculated values will be scaled by an identical constant, so there's no issue there. However, the "per million of reads" is a per-replicate variable, and will be identical for both the biological transcript and the exogenous ones, so I assume they will cancel each others out. Is that right? And if it is, is that something which is desirable (since I am, after all, seeking to normalize on the amount of ERCC transcripts), or should I switch over to normalizing using the total number of aligned ERCC transcripts, for example?

normalization fpkm rna-seq • 12k views
ADD COMMENTlink modified 4.7 years ago by Devon Ryan79k • written 4.7 years ago by Eric Fournier1.4k

How will you use cuffdiff to evaluate significant changes in expression with your normalized FPKM files? I have so far only been successful using cuffdiff with .bam or .sam file input. Is there a way to give cuffdiff an input of spike-in normalized FPKM files?

ADD REPLYlink written 4.6 years ago by jep0

Does anyone have an opinion or answer to this question?

ADD REPLYlink written 4.5 years ago by jep0
gravatar for Devon Ryan
4.7 years ago by
Devon Ryan79k
Freiburg, Germany
Devon Ryan79k wrote:

Remember that the "per million reads" part of fpkm is actually a library size normalization step. Undoing that with spike-ins is fine, then, since you're then producing a proper transcript length normalized count subsequently normalized by your spike-ins. If you're using cuffdiff next, be sure to change the default library size normalization so it doesn't undo all your hard work!

I'm curious how well this works with cuffdiff (I assume that's what you're using) and how the results compare to the same data in DESeq, where I find dealing with spike-ins more straight forward.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Devon Ryan79k

dpryan79 -- Can you provide more details for how you normalize for spike-ins with DESeq?

ADD REPLYlink written 4.6 years ago by jep0

Read in the count data, subset the resulting matrix such that it includes only the spike-ins, create a DESeqDataSet from that and then just estimateSizeFactors() on the results. The size factors can then be placed in the appropriate slot on the DESeqDataSet for the full count matrix (make sure to remove the spike-ins, since you no longer need them).

Edit: The same procedure would work for edgeR or limma as well. This is also part of the modification that SAMstrt makes to SAMseq, if you're interested in just using that.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Devon Ryan79k

Thank you. I just now found this response and am trying it today. I have not found a way to use cuffdiff for anything other than .bam or .sam file input, which precludes me from trying to compare how well cuffdiff works with normalized FPKM input compared to DESeq and EdgeR. I agree with your above comment that this would be an important comparison. Best,

ADD REPLYlink written 4.5 years ago by jep0

There's likely some hacking of the source needed since cuffdiff tries to re-estimate fpkms given a merged annotation. You'd just need to get around that step and then the remainder should work. Personally, I simply wouldn't use cuffdiff for this sort of task.

ADD REPLYlink written 4.5 years ago by Devon Ryan79k

Hi Devon

(1) Could you reply some R command here for all the steps you mentioned (how you normalize for spike-ins with DESeq/DESeq2) ?For example you said that  "subset the resulting matrix such that it includes only the spike-ins", but I don't know how to find the one only the spike-ins.

(2) If you do similar in edgeR, do we still need to use CPM to trim  counts table?


Thank you so much!

ADD REPLYlink written 2.6 years ago by super40

(1) You'd have to either know ahead of time which rows have the spike-ins or know the names that they go by.

(2) "Need" is a bit strong, but you'll probably benefit from doing so (simply for the sake of statistical power).

ADD REPLYlink written 2.6 years ago by Devon Ryan79k

I haven't been using cuffdiff for differential expression testing, as it has been crashing when I try to feed it 30 multi-gb bam files at a time.

ADD REPLYlink written 4.5 years ago by Eric Fournier1.4k

generally we've stopped using the tuxedo suite altogether for similar reasons.

ADD REPLYlink written 3.9 years ago by earonesty200
gravatar for Jonathanjacobs
4.7 years ago by
Rockville, MD
Jonathanjacobs150 wrote:

If you are going to use the ERCC data, then (IMHO) you should use loess normalization on your raw data using the ERCC data factored in.

See here for a related question: question in normalizing with ERCC spike-in control

ADD COMMENTlink written 4.7 years ago by Jonathanjacobs150

I've tried using loess for normalization, but the results have been pretty awful. I cannot say with certitude why this issue arises, but I think it might be because in my data, the ERCC concentration - FPKM correlation goes down the drain at lower concentrations, with FPKMs that sometime vary by a hundredfold for the same concentration.

ADD REPLYlink written 4.5 years ago by Eric Fournier1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1571 users visited in the last hour