Question: Only doing log2 transformation for RNAseq read count data without normalizing by DESeq2 ??!
1
gravatar for Raheleh
9 weeks ago by
Raheleh150
Raheleh150 wrote:

Hello all,

To normalize my read count data I used 2 different approaches:

1) normalized them with DESeq2 and then transformed them to log2 format.

2) I only transformed the read count data to log2 format (without normalizing using DESeq2).

I realized that the outputs are pretty much the same!! Can any one tell me why is that so? I am confused if the output is the same then why we use DESeq2 to normalize them? why not only do log2 transformation??

This is part of my read count data:

                              WT1    WT2    WT3 ACA1 ACA2 ACA3
ENSMUSG00000022857         61     27     54     733     238     332
ENSMUSG00000094478         1    321      0       0       2       0
ENSMUSG00000096410         1225   1319    648     126      32     119

1) I normalized them using DESeq2 and then transformed to log2:

my script:

cds <- DESeqDataSetFromMatrix(read.count[,-1], colData, design = ~group)
dds <- estimateSizeFactors(cds)
normalized_df <- counts(dds, normalized=TRUE)
normalized_df.log <- log2(normalized_df+1)

this is part of the output after normalizing by DESeq2 and transforming to log2:

                          WT1       WT2       WT3   ACA1  ACA2  ACA3
ENSMUSG00000022857  5.9533944  4.821842  5.792608  9.524640  7.902013  8.380811
ENSMUSG00000094478  0.9995925  8.345891  0.000000  0.000000  1.585730  0.000000
ENSMUSG00000096410 10.2589289 10.381332  9.353513  6.993656  5.045510  6.908315

2) This is the result after only doing log2 transformation (without normalizing using DESeq2):

                         WT1       WT2       WT3   ACA1  ACA2  ACA3
ENSMUSG00000022857  5.954196  4.807355  5.781360  9.519636  7.900867  8.379378
ENSMUSG00000094478  1.000000  8.330917  0.000000  0.000000  1.584963  0.000000
ENSMUSG00000096410 10.259743 10.366322  9.342075  6.988685  5.044394  6.906891

Many thanks!

log2 rna-seq readcount deseq2 • 312 views
ADD COMMENTlink modified 9 weeks ago by ATpoint35k • written 9 weeks ago by Raheleh150
1

DESeq2 normalizes for library depth. If your samples are all from the same library, normalization may not have a pronounced effect. Could that be the case here?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by RamRS27k

no they are from different library, because the sum of readcount for each sample is different.

sum(read.count$WT1)
[1] 13427116

sum(read.count$WT2)
[1] 12775113

sum(read.count$WT3)
[1] 13606000

sum(read.count$ACA1)
[1] 12969172

sum(read.count$ACA2)
[1] 16631477

sum(read.count$ACA3)
[1] 15890129
ADD REPLYlink modified 9 weeks ago by RamRS27k • written 9 weeks ago by Raheleh150

Its not cpm normalization where the count is divided to per million reads, it is more normalized with size factor, to convert non-normally distributed data to normally distributed data. As the data is normal distribution to begin with, may be the size factor is close to one! You can obtain the normalizing factor by dividing counts with normalized counts.

ADD REPLYlink written 9 weeks ago by piyushjo480

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLYlink written 9 weeks ago by RamRS27k

Why exactly you want to normalize counts by log2?

Can you also print output counts after normalization from DESeq2.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by piyushjo480

This is my data after normalization using DESeq2:

                           WT1          WT2          WT3     ACA1    ACA2    ACA3
ENSMUSG00000022857 6.096555e+01 2.728259e+01 5.443049e+01 7.355503e+02 2.381900e+02 3.323308e+02
ENSMUSG00000094478 9.994352e-01 3.243596e+02 0.000000e+00 0.000000e+00 2.001596e+00 0.000000e+00
ENSMUSG00000096410 1.224308e+03 1.332805e+03 6.531658e+02 1.264384e+02 3.202554e+01 1.191186e+02
ADD REPLYlink modified 9 weeks ago by RamRS27k • written 9 weeks ago by Raheleh150
1

As stated below, the counts after normalization looks similar to original counts due to normal distribution of the original data.

Also I normally prefer using vst normalized data for pca or any other processing (like WGCNA)

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by piyushjo480
5
gravatar for ATpoint
9 weeks ago by
ATpoint35k
Germany
ATpoint35k wrote:

The DESeq2 normalization in the context of RNA-seq aims to correct for differences in sequencing depth, taking library composition into account. Here is a wonderful video illustrating the process. Why not only log2? Because log2 does not remove differences in read depth which are always present in sequencing experiments, that is why count normalization is a standard step in the workflow. I cannot explain why in this specific case things look so similar, for this one would need access to the raw data and full code to make sure it is not an issue related to using flawed commands. Is this raw data you start from without modifications? Is this RNA-seq? If so then it is unlikely that the data are normally-distributed so something is probably wrong here. Please also show how extracted the unnormalized counts that you show for these three genes above.

Something else: I do not know where you all get the idea from that RNA-seq normalization will transform the non-normally distributed data to a normal distribution. That is simply wrong. RNA-seq data will follow roughly a negative binomial distribution before and after normalization. This is obvious because the normalization calculates a single scaling factor per library but does not perform any transformation. Therefore the overall distribution does not change. You will still have many genes with 0 or very low counts (the non-expressed genes) and few genes with very high expression plus all the genes in between these two extremes. This is not a normal distribution. The logarithm also does not change that. Tools such as DESeq2 even model counts as NB for that reason.

Here some plots using standard RNA-seq data from one of my experiments:

## take output of tximport and feed into DESeq2    
dds <- DESeqDataSetFromTximport(txi = txi, colData = colData, design = ~1)
dds <- estimateSizeFactors(dds)

## Lets look at one sample:
counts_raw <- counts(dds, normalized = FALSE)[,1]
counts_norm <- counts(dds, normalized = TRUE)[,1]
vsd <- assay(vst(dds))[,1]

## Plots
par(mfrow=c(2,2))
hist(log2(counts_raw+1), main="raw counts")
hist(log2(counts_norm+1), main="log2 norm. counts")
hist(assay(vst(dds))[,1], main="vst")

enter image description here

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by ATpoint35k

I have no idea what this sofware does exactly, but I do know the meaning of normalization. That can be achieved by calculating mean and StDev, via Box-Cox transformation, using the transformations you described, etc. The point is that OP's data didn't change much after normalization, which is why the end data is similar. And my suggestions stands regardless of what exact "normalization" was done: it is a good idea to normalize even if the modification ends up being minuscule compared to original data.

ADD REPLYlink written 9 weeks ago by Mensur Dlakic5.5k

In that case, you should not be so confident saying "to convert your data so it has a normal distribution"

ADD REPLYlink written 9 weeks ago by RamRS27k

Appreciate the suggestion.

ADD REPLYlink written 9 weeks ago by Mensur Dlakic5.5k

Thanks @ATpoint. The data is RNAseq. I got read count from someone else but I think this is the approach that has been used to get the read count from fastq data:

Fastqs were quality assessed by fastqc, then adaptors and sequences below a Phred score of 20 were trimmed using trimgalore. Reads were aligned to mm10 version 98 using HISAT2, then gene level counts were determined from exons using featureCounts with the –primary (counts primary alignments only), -p (counts fragments instead of reads) and –C (discards read pairs mapping to different chromosomes) options.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Raheleh150
1

That all sounds reasonable. As said it is difficult to tell from remote. I would stick with the standard workflow including normalization which is well-tested and accepted, so basically running DESeq().

ADD REPLYlink written 9 weeks ago by ATpoint35k
1
gravatar for Mensur Dlakic
9 weeks ago by
Mensur Dlakic5.5k
USA
Mensur Dlakic5.5k wrote:

The purpose of normalization is to convert your data so it has a normal distribution. If your data already has a normal distribution, that transformation will do nothing to it. Apparently, your original data is almost in that category.

why not only do log2 transformation??

It is always a good idea to do normalization because you will not always have a normal distribution in your data. Even in the case like this when the final result is near-identical, you should still use normalized + log2-transformed data.

ADD COMMENTlink written 9 weeks ago by Mensur Dlakic5.5k
5

Normalization is not transformation, its purpose is not to make the data (more or less) normally distributed. Of the common RNA-seq packages, only voom results in a transformation to a normalish distribution, the others keep the data as is and use normalization to compute per-sample offsets for their GLMs.

ADD REPLYlink written 9 weeks ago by Devon Ryan95k
1

Thank you. Appreciate an additional explanation.

ADD REPLYlink written 9 weeks ago by Mensur Dlakic5.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1772 users visited in the last hour