RNA-seq normalization: How to use TMM and rpkm() in EdgeR
2
8
Entering edit mode
8.1 years ago
AW ▴ 350

Hi,

I have some RNA-seq samples that I want to normalize and then output RPKM expression, but I am unsure how to do this.

This is my pipeline so far:

Normalise raw read counts with TMM in edgeR

expr <- DGEList(counts=data, group=conditions)
expr <- calcNormFactors(expr)


output:

\$samples
group lib.size norm.factors
Sample1     F 19770521    1.0462660
Sample2     F 17970679    0.8794805
Sample3     F 19184265    1.0573665


QUESTION: How do I get normalized raw read counts from this? Do I multiply the read counts by the norm.factors?

QUESTION: Ultimately, I want to end up with RPKM values for each gene in each sample. I know I can use the rpkm() function below in edgeR

expr_norm <- rpkm(expr, log=FALSE,gene.length=vector)


but is expr the output from calcNormFactors or something else?

A

normalization RNA-Seq RPKM TMM EdgeR • 25k views
4
Entering edit mode
8.1 years ago

The library size normalized counts are made by dividing the counts by the normalization factor (you'll note that the larger libraries have larger normalization factors, so if you multiplied things you'd just inflate the difference in sequencing depth).

For the rpkms, just do rpkm(expr, gene.length=vector), since it can take your DGEList, (this is explained in help(rpkm)).

0
Entering edit mode

great thanks very much!!

so just to clarify if I do rpkm(expr) then the output will be rpkm files that are both TMM normalised and also length normalised? Is this because it is reading in the norm.factors and the unnormalised read counts from DGEList?

thanks!

0
Entering edit mode

Oops, I forgot the "gene.length" part. I've updated my answer accordingly. Regardless, yes, rpkm can get the normalization factors from the DGEList, so you needn't much with that yourself. If you input counts as a matrix then it'll just do library size normalization, which is less robust.

0
Entering edit mode

Thanks! And when you say as a matrix, you mean if I divided the counts by the normalisation factor manually and then gave this as an input to rpkm()?

0
Entering edit mode

Or just fed it in as is. Feeding in the normalized counts would probably lead to weird results, since the M in RPKM is there for normalization (so you would have normalized a normalized result...which will likely muck it up slightly).

0
Entering edit mode

Cool, thanks! So I will use the DGEList output from calcNormFactors as the input for rpkm along with gene length information!

0
Entering edit mode

I also have two other questions, that Id be very grateful if you could answer.

Q1. When creating the expr <- DGEList(counts=data, group=conditions), what effect does specifying groups have one the TMM normalisation? How does TMM use this information and how would the results differ if you did specify groups versus not?

Q2. The expression data I am using was obtained from mapping reads onto denovo contigs assembled with Trinity. I then chose the most highly expressed contig from each cluster as the "best isoform" and then summed expression across all the contigs in the cluster as the expression value for that cluster. Therefore I do not have one obvious gene length to use. Should I use the longest contig from the cluster?

Thanks!