Normalisation of paired samples in edgeR
0
0
Entering edit mode
2.0 years ago

I am performing differential expression of 10 paired samples (cancer and normal tissue) in edgeR and I'm following '3.4.1 Paired samples' in the Bioconductor User's Guide.

Do the library sizes need to be normalised prior to testing for treatment efftect?

Normalised with:

y <- calcNormFactors(y)


Estimating dispersion, fitting to a linear model and testing for treatment effect.

y <- estimateDisp(y,design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit)
topTags(qlf)


I don't get any differentially expressed genes after I normalise, but if I omit normalisation I get differentially expressed genes.

RNA-Seq edgeR • 865 views
1
Entering edit mode

Normalization is independent from the experimental design, and yes it needs to be performed. Whatever results you get without norm. is not meaningful. You might incorporate the FilterByExpr filter as recommended in the manual.

Do you start from raw counts as edgeR expects? How does your design and the groups look? Did you check for batch effects using PCA on the logCPMs? A plot I find most useful is the MA-plot, so plotting logCPM on the x- and logFC on the y-axis. This both shows whether normalization is proper (most points should center along y = 0, and how the fold changes behave, so whether there are simply no large FCs or whether the large FCs are simply not significant). In the latter case the volcano plot is another useful type of plot for results exploration.

0
Entering edit mode

Great thank you, I'll include normalisation.

I have a filtering section:

keep <- rowSums(cpm(y)>0.5) >=2


My groups looks like this:

      files  group    lib.size                 norm.factors  subjects
a_1   1.csv control  16065685            1.8069450     patient1
a_2   2.csv control  4740572              2.1098124     patient2
a_3   3.csv control  19853317            1.8273974     patient3
a_4   4.csv cancer  22955672            0.8591707     patient1
a_5   5.csv cancer  38906433            0.6714201     patient2
a_6   6.csv cancer  21069541            1.2216965     patient


My design looks like this:

   design <- model.matrix(~0+subjects+group)


I haven't checked for batch effects, I will give it a go.

1
Entering edit mode

See the lib.sizes, the cancers are sequenced much deeper than the controls, this is one of the reasons why normalization is necessary. The counts in cancer are propably much higher simply because of that, and you have to correct for, details here. Try to do the PCA first, e.g. using the PCAtools package from Bioconductor or simply using the plotMDS function from edgeR/limma which implements a very similar technique. This will also tell you how well samples cluster together which can be a proxy on the dispersion between replicates. Since you start from csv files, may I ask how you obtained the counts?

0
Entering edit mode

I did PCA using plotMDS on the raw data and on the normalised data. I have 'Leading logFC dim2' along the y-axis and 'Leading logFC dim1' along the x-axis. For the raw data the samples cluster along y=0, and after normalisation the samples are more dispersed.

I'm working with circRNAs and I used two detection methods and merged on commonly detected circRNAs using a script which output to .csv files. Raw

Normalised

1
Entering edit mode

PCA should be done on the log2-transformed normalized data. I am currently putting together a little tutorial on basic QC including PCA and MA-plots for DNA/RNA-seq, probably goes online early next week that covers the basics with example code. Maybe this clarifies some things.

0
Entering edit mode

Awesome, that'll be of great help, where will the tutorial be available?

1
Entering edit mode

Will post it here on biostars.

0
Entering edit mode

Also, was it correct for me to to the PCA on the normalised data?

1
Entering edit mode
0
Entering edit mode

Great, thank you. Just saw this. I'll give this a go.