DESeq2 normalization - Found Difference in values when calculated manually
1
0
Entering edit mode
2.0 years ago
Sandhiya ▴ 20

Hi All,

I learnt about how DESeq2 normalises the raw read counts in statquest. It was very helpful and I tried to repeat the steps for my sample data.

But I am getting the output with minor difference - like for each sample outpu I get some difference .

Example: The above is the DESeq2 normalised count which I retrieved using count() in DESeq2 library.

When I manually calculate the normalization I get the following output : So there is difference . I'm confused why i get this different values ?

DESeq2 normalization difference • 2.4k views
0
Entering edit mode

What is the code you're using to maually calculate the values?

0
Entering edit mode

For manual calculation I have not used code. I tried to do it following the steps:

step 1:take log of all values - ie.,read count Step 2: Average each row Step 3: filter zero and infinite values - not applicable for our data Step : 4 Subtract the averae log value from log(counts) Step5 : calculate median for each sample Step 6: convert medians to normal numbers to get the final scoring factors for each sample Step 7 : Divide original read counts by scaling factors:

0
Entering edit mode

This is the exact code you can use to double-check: https://github.com/mikelove/DESeq2/blob/master/R/core.R#L534-L577

0
Entering edit mode

Thank you for the code . I have exactly followed the same steps. But I get a different output when I calculate manually.

0
Entering edit mode

Do you take the natural log rather than log10 or log2?

0
Entering edit mode

I used log10 and checked output - it was different; log2 result is also different for deseq output..

3
Entering edit mode
2.0 years ago
ATpoint 76k

I cannot see your code or your notes that you write down but here is how it goes in R using only base functions, following what the StatQuest video does with some dummy data. Results are identical to DESeq2, because StatQuest is awesome :)

data <- data.frame(Sample1=c(1000, 1098, 2234), Sample2=c(987, 778, 888), Sample3=c(999,888,0),
row.names = c("gene1", "gene2", "gene3"))

#/ Step1: take the log of all values (natural log!)
step1 <- log(data)

#/ Step2: take the mean:
step2 <- apply(step1, 1, mean)

#/ Step3: remove infinities:
step3 <- step2[is.finite(step2)]

#/ Step4: subtract the average log value from the log-transformed raw counts:
step4 <- step1[names(step3),] - step2[names(step3)]

#/ Step5: calculate the median for each sample (so column-wise)
step5 <- apply(step4, 2, median)

#/ Step6: put back to normal scale:
step6 <- exp(step5)

data.frame(manually=step6,
DESeq2=as.numeric(DESeq2::estimateSizeFactorsForMatrix(data)))

manually    DESeq2
1 1.0998199 1.0998199
2 0.9197479 0.9197479
3 0.9885750 0.9885750


Edit: Here is a spreadsheet with an Excel implementation: https://uni-muenster.sciebo.de/s/xQHwwyVrG9NFqst

1
Entering edit mode

BAM ! #statquest

0
Entering edit mode

Thank you. I much appreciate your solution and Statquest - of course, I have learnt many concepts from this channel.

But for the example which you have discussed here, I'm not getting the values that you have mentioned.

I request you to correct me with my steps: 0
Entering edit mode

I can confirm ATpoint 's calculations on my machine. Not sure what is going on but i think it might be during your scaling step

0
Entering edit mode

I'm still figuring out "why there is difference"

1
Entering edit mode

when i am having trouble replicating results, I normally start by eliminating any and all possible differences of which I am aware.

This is a personal thing, but I would start in R, since that's how DESeq2 is implemented...

hope it helps!

0
Entering edit mode

Thank you! You mean to say - just focus on the normalized count values obtained from deseq2 output. Right?

0
Entering edit mode

You can implement it in any language you want, even Excel...the steps are not difficult but I wonder why you do not get the results you want. Be sure that the order of operations is correct, e.g. in step4 it is logged counts minus the means of step1 and not vice versa.

0
Entering edit mode

Thank you ATpoint . I checked your excel - you have implemented all steps but in your excel Step 7 : Divide original read counts by scaling factorsis missing. This step 7 is the final one which gives the normalized counts - as per the Statquest video . You also check the step 7 - final result is not the same when you compare with the one from deseq2 normalized count.