Question

RPKM normalization - small RNA sequencing

0

Entering edit mode

7.6 years ago

Pål Nilsen • 0

Asking for a second hand opinion on my reasoning here in regards to not using RPKM to normalize between my samples. Any feedback would be highly appreciated.

Reasoning for not normalizing for transcript length and total mapped reads per sample (RPKM): I do not analyze differential gene expression within samples. I analyze mapped reads to the same target (e.g. virus genome sequence or virus gene sequence) between samples. Thus I do not have to normalize for transcript length (since they are of the same length). Regarding “total mapped reads per sample”, I expect these to differ between samples due to inherent characteristics of the samples. For example, from a sample taken from a highly infected individual I would expect there to be a higher level of siRNA specific reads compared to a sample from a uninfected individual. Thus, since I measure the differences between different treatment groups I do not want to normalize for variation in total reads available. NB! An alternative option would be to normalize for overall total reads collected in the sequencing procedure in case some samples went through deeper sequencing.

RNA-Seq • 2.8k views

ADD COMMENT • link updated 7.6 years ago by Carlo Yague 8.7k • written 7.6 years ago by Pål Nilsen • 0

score 1 · Answer 1 · 2016-09-20

1

Entering edit mode

7.6 years ago

Carlo Yague 8.7k

Yes, RPKM is probably not an optimal option in your case. This nice blogpost (I'm not the author) explains nicely why RPKM is ill suited to differential expression analysis.

However, your alternative idea (normalization on total read sequenced) is not so good IMO. A lot of things can go wrong during or before library preparation and impact the ratio of gene expression on total read sequenced. For instance, contamination, degradation, ribodepletion, ...

The best way to normalize such dataset would be to use spike in before library preparation. But if it's too late you probably should use a count based method such as DESeq and edgeR that are robust to massive changes in a few features (siRNA in your case).

ADD COMMENT • link 7.6 years ago by Carlo Yague 8.7k

0

Entering edit mode

Thanks for the feedback. Your opinion in regards to using normalization to total reads sounds reasonable to me. I guess I am, being a newbie in the field, not aware of the different pitfalls in RNA-seq. In retrospect, a spike in would have been a good idea and suitable to detect technical variations between samples in the library prep and sequencing procedure.

My results stems from 10 RNA samples (pools of 5 biological replicates) sampled from different treatment groups at various time points. I know I should have used technical replicates in the library prep and sequencing procedure, but cost was an issue here. However, the results I have gathered so far paint a pretty picture being very consistent with what I expect and in accordance with several other types analyses. Thus I am not sure if DESeq will help. Wouldn't this method be more useful when applied to check for variance between technical replicates?

ADD REPLY • link 7.6 years ago by Pål Nilsen • 0

0

Entering edit mode

I see. The DESeq workflow can be split in two steps :

Read counts normalization, where you normalize based on the read counts per gene distribution. This kind of normalization is quite robust to the technical issues that can mess with the total read count.
Calling differential expressed genes (gives you pvalue, etc...)

Step 2 is best with replicates (although it works without replicates*). However, the normalization step is independent of replicates so you can still normalize your data using DESeq then use the normalized counts for the next steps of your analysis.

* See the DESeq2 manual section 5.8:

If a DESeqDataSet is provided with an experimental design without replicates, a warning is printed, that the samples are treated as replicates for estimation of dispersion. This kind of analysis is only useful for exploring the data, but will not provide the kind of proper statistical inference on differences between groups

ADD REPLY • link 7.6 years ago by Carlo Yague 8.7k