Question: How to normalize count data for PCA in R - something goes wrong
gravatar for m.laarman
8 months ago by
m.laarman10 wrote:

Hi all! I've been trying to do PCA for my RNAseq data to examine the relationship between my samples. I have been asking questions under a different post and Kevin Blighe suggested I start a new post, so here it is.

I have RNAseq data from 3 types of tissues, one disease tissue (n=44) and two potential control tissues (n=16 and n=4). I want to do some analysis, and one of which is PCA to get some info on the relationship between the groups and to see if one of the two potential controls is more similar to the disease tissue in gene expression.

So, I have raw readcounts for these 64 samples.

Kevin Blighe suggested using DESeq2 and I've used the following code:

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ batch + condition)
dds <- DESeq(dds)

normalized_counts <- counts(dds, normalized=TRUE)

vsd <- vst(dds, blind=TRUE)

In this code I used 'batch' because I have also some other samples from other studies, that were therefore not part of the same experiment and I thought that a factor batch (defined in coldata) could help to remove some of the effect of these different experiments. Please tell me if this is wrong.

Then I put the vsd data in a data frame to be able to generate histograms of each column (sample):

counts_norm_logged <-

I used a loop to generate histograms for all samples, and the distribution looks very similar between all these samples. This is an example:

histogram normalized logged counts of sample 1

So, there is a very large bar that represents all the genes that are not expressed in the sample, because their normalized logged count becomes a value around 4.

If I zoom in by cutting the y-axis at 10000, you get this image:

same histogram, y-axis cut at 10000

So, something is not really right here since my data should be more or less normally distributed now, after the normalization and logging.

There is one thing that might be related to this that I've wondered about for a while, since someone else mapped my data. The count data that I received back contains 63677 genes, while datasets I've worked with before contained 45274 genes. I've now googled it and I think that the human genome is not supposed to have that many genes...? Or am I mistaken? UPDATE: I've run the same code on some old datasets with 45274 genes and it gives a similar distribution, just the first bar is lower because it contains less genes (so I do think that the additional 20000 genes (almost) all have zero counts). Here is an example of these data with the same code:

example distribution normalized logged counts 'old' data

That said, it might not be clear that I have not removed lowly expressed genes from these datasets. Should I have done this? And if so, what would be a good method and cut-off for this, since I've struggled with this before since it seems so arbitrary, at least to me, a person who is not to knowledgeable in this field.

Thank you in advance! I really hope someone can help me with this!

rna-seq R • 1.0k views
ADD COMMENTlink modified 8 months ago • written 8 months ago by m.laarman10

PCA usually benefit from first z-scoring, which only make sense if you a normal distribution for the expression data. Typically take a simple cut off at 1 will do, as it is predominately the 0s that you don't want. Gene pattern recommended 1 read per million for the gene over n samples as the threshold. Hannun from WCGNA wrote something about filtering on mean/std. Usually, after a log transforms on the expression level distribution, it becomes bimodal, which is unclear in your case. Can you plot the histograms with more bins?

Anyhow, if you are only interested in generating a PCA plot, just take the genes with highest STD for now, and see if the signal is there. Forget about the batch for now. The PCA will show if it is there.

  1. List item

ADD REPLYlink written 8 months ago by btsui290

Hi btsui, Thanks for you reply!

First of all, I've generated a few histograms with more bins (these are made from the whole datasets of 64 samples together, as suggested by Kevin below, I can also make them for a single sample if that is what you were looking for):

Histogram with the original number of bins, but with all 64 samples together:

Twice as many bins:

Five times as many bins:

Then I have a few questions: What is and how do I do z-scoring?

And a cut-off of 1, that is then based on a single sample I guess? And is this based on raw counts or something else?

I'm also trying to filter out the highest STD genes as you suggested, using the following code with an totally arbitrary cut-off of 10:

keep <- rowSds(counts(dds)) >= 10
dds_STD10 <- dds[keep,]

Of the 63677 original genes, I am then left with 15894 genes. Would you say that the way I do this is correct?

Then, my histogram looks like this:

I've generated a PCA on the data as is, with the histogram that I've provided before, and I actually think it already looks nice, but now I want it to be solid of course, based on actually normallly distributed data.

I've generated three PCA plots (plotPCA(vsd,intgroup="condition")) with their own histograms, 1 based on all the genes, 1 based on the genes for which rowSums>=10 and one for which rowSds>=10:

all genes (63677 genes):

rowSums>=10 (31748 genes):

rowSds>=10 (15894 genes):

All three PCA plots look very very similar I would say (there are tiny difference, but the shape is very similar). I guess this makes sense, since PCA looks at the genes with the highest STD anyway, and ignores the genes that are lowly expressed in all samples and the genes that have a low STD? And I'm expecting the groups to be very similar also, so I guess it makes sense that the PCA plot looks like this.

Thanks in advance!

ADD REPLYlink modified 8 months ago by RamRS21k • written 8 months ago by m.laarman10
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe41k
Guy's Hospital, London
Kevin Blighe41k wrote:

Hey, sorry for getting here late. btsui's comments are valid - you should take them on board and explore.

I just looked through your answer and the following lines struck me:

counts_norm_logged <-

The output of the vst() function is not strictly logged data (as far as I know); rather, it's a transformation based on the dispersion, which is inherently based on variance / covariance. Also, by generating a histogram on just counts_norm_logged[,1], you are only plotting data for the first sample. You should try the following and plot the histogram for the entire dataset combined:

counts_norm_vsd <- assay(vsd)

How does it look after that?


Regarding the number of transcripts, the ~60,000 mark would tell me that you've got both coding and non-coding transcripts in your dataset. When the GENCODE project was completed, upward of 200,000 human transcript isoforms had been identified, covering ~60,000 individual genes. These are then grouped into what have become known as 'biotypes'. You can see a full list here:

This makes for RNA-seq experiments to be really interesting; however, in any particular experiment, many of these transcripts will not be expressed in the tissue of interest. These 'zero count' transcripts should ideally be filtered out before you even run DESeq2.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Kevin Blighe41k

I'm now running code to filter my data with different cut-offs and generating new histograms, however, while going through the code again I noticed something:

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ batch + condition)
dds <- DESeq(dds)

normalized_counts <- counts(dds, normalized=TRUE)

vsd <- vst(dds, blind=TRUE)

counts_norm_logged <-

Now, aside from the fact that I've called them logged counts, I don't see how I'm actually using the normalized counts here to for my histogram...: I create a new table containing the normalized counts (normalized_counts) and then I run the vst function, but I run it on dds, not on the table of normalized counts. And then I make a histograms based on the result from the vst function, that, as far as I'm aware doesn't contain the normalized counts.

Is this correct?

And if so, should I do it as follows:

dds <- counts(dds, normalized=TRUE)
vsd <- vst(dds, blind=TRUE)

This doesn't work, give me the error: Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers

or should I do the vst function on the normalized_counts table:

vsd <- vst(normalized_counts, blind=TRUE)

Here I obviously get the same error, since I'm basically only changing the name of the thing.

I'm sure I'm misunderstanding something somewhere, so I hope you will be able to help me clear it up :) Thanks!!

ADD REPLYlink written 8 months ago by m.laarman10

The correct approach is to perform vst on the DESeq object (although it can also be performed on integer data). In your code:

dds <- counts(dds, normalized=TRUE)
vsd <- vst(dds, blind=TRUE), you actually overwrite your DESeq object, dds, with the first line, and then try to run vst on what is then a simple count matrix with no relation to DESeq. You just need to run vsd <- vst(dds, blind=TRUE) without the first line.

The DESeq2 vignette goes over this, Extracting transformed values .

The only major issue that I had with your first histograms is that you were only plotting for a single sample with hist(counts_norm_logged[,1]). The new plots (including the PCA bi-plot) that you've posted above look fine. Try also a simple boxplot(), and generally, the other things mentioned here: Data quality assessment by sample clustering and visualization.

ADD REPLYlink written 8 months ago by Kevin Blighe41k

Hi Kevin, Thanks for your reply.

So, by running

vsd <- vst(dds, blind=TRUE)

It automatically also normalizes the data? Or what is the

normalized_counts <- counts(dds, normalized=TRUE)

For then? Or is this last line just to extract the normalized counts out of the DESeqDataSet? And the vst function knows to use the normalized data instead of the raw data? Is that how I'm supposed to understand it? :)

Then, what you say about the new histograms above, in response to btsui. The first few histograms, without the filtering of genes, I guess they don't look good, right? I mean, the massive peak for unexpressed genes I guess is making the distribution not normal. So, do I have to filter them out until this peak is completely gone, or is there some intermediate (I don't want to filter out genes that might be relevant because they are somewhat expressed in some of the samples)?

I also would love to do some Pearson correlation on these data. For that I also have to use the normalized counts, right? The same data as I used for the histograms? And I would also then need the same kind of filtering for not/lowly expressed genes, right? Are there some additional tips you could give me for the Pearson correlation? Because so far I've noticed that if I run the Pearson correlation on differently (wrongly) normalized and logged data, even tissues that are not supposed to be anything alike (for instance brain and liver) have a high correlation coefficient, so I'm wondering if I'm actually doing it right... Or should I filter on high STD genes for this or something?

How do I extract the right data from the DESeqDataSet to use for Pearson correlation? Is it one of the below options or something different?

normalized_counts <- counts(dds, normalized=TRUE)
counts_norm <-

If I used counts_norm, then basically all Pearson correlation coefficients (also between tissues like brain and liver) are basically between 0.9 and 1.

Thanks again in advance!!

ADD REPLYlink written 8 months ago by m.laarman10

It automatically also normalizes the data?

Of course, the authors know what they're doing, all of the defaults are appropriate for the overwhelming majority of use-cases. Regarding counts(dds, normalized=TRUE), it's for getting normalized counts. Sometimes that's nice for plotting or people like to look at them to get a feel for the data.

The histograms you posted look fine. You are vastly over thinking and worrying about this.

I also would love to do some Pearson correlation on these data. For that I also have to use the normalized counts, right?

This is covered with a very nice heatmap example in the DESeq2 vignette. Read the vignette and reproduce what's in it.

ADD REPLYlink modified 8 months ago • written 8 months ago by Devon Ryan89k

Thanks for your reply. I do tend to overthink things ;)

I'll try the code in the vignette.

ADD REPLYlink modified 8 months ago • written 8 months ago by m.laarman10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1783 users visited in the last hour