How to normalize count data for PCA in R - something goes wrong
1
0
Entering edit mode
3.4 years ago
m.laarman ▴ 10

Hi all! I've been trying to do PCA for my RNAseq data to examine the relationship between my samples. I have been asking questions under a different post and Kevin Blighe suggested I start a new post, so here it is.

I have RNAseq data from 3 types of tissues, one disease tissue (n=44) and two potential control tissues (n=16 and n=4). I want to do some analysis, and one of which is PCA to get some info on the relationship between the groups and to see if one of the two potential controls is more similar to the disease tissue in gene expression.

So, I have raw readcounts for these 64 samples.

Kevin Blighe suggested using DESeq2 and I've used the following code:

dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~ batch + condition)
dds <- DESeq(dds)

normalized_counts <- counts(dds, normalized=TRUE)

vsd <- vst(dds, blind=TRUE)
assay(vsd)


In this code I used 'batch' because I have also some other samples from other studies, that were therefore not part of the same experiment and I thought that a factor batch (defined in coldata) could help to remove some of the effect of these different experiments. Please tell me if this is wrong.

Then I put the vsd data in a data frame to be able to generate histograms of each column (sample):

counts_norm_logged <- as.data.frame(assay(vsd))
hist(counts_norm_logged[,1])


I used a loop to generate histograms for all samples, and the distribution looks very similar between all these samples. This is an example:

So, there is a very large bar that represents all the genes that are not expressed in the sample, because their normalized logged count becomes a value around 4.

If I zoom in by cutting the y-axis at 10000, you get this image:

So, something is not really right here since my data should be more or less normally distributed now, after the normalization and logging.

There is one thing that might be related to this that I've wondered about for a while, since someone else mapped my data. The count data that I received back contains 63677 genes, while datasets I've worked with before contained 45274 genes. I've now googled it and I think that the human genome is not supposed to have that many genes...? Or am I mistaken? UPDATE: I've run the same code on some old datasets with 45274 genes and it gives a similar distribution, just the first bar is lower because it contains less genes (so I do think that the additional 20000 genes (almost) all have zero counts). Here is an example of these data with the same code:

That said, it might not be clear that I have not removed lowly expressed genes from these datasets. Should I have done this? And if so, what would be a good method and cut-off for this, since I've struggled with this before since it seems so arbitrary, at least to me, a person who is not to knowledgeable in this field.

Thank you in advance! I really hope someone can help me with this!

RNA-Seq R • 4.4k views
0
Entering edit mode

PCA usually benefit from first z-scoring, which only make sense if you a normal distribution for the expression data. Typically take a simple cut off at 1 will do, as it is predominately the 0s that you don't want. Gene pattern recommended 1 read per million for the gene over n samples as the threshold. Hannun from WCGNA wrote something about filtering on mean/std. Usually, after a log transforms on the expression level distribution, it becomes bimodal, which is unclear in your case. Can you plot the histograms with more bins?

Anyhow, if you are only interested in generating a PCA plot, just take the genes with highest STD for now, and see if the signal is there. Forget about the batch for now. The PCA will show if it is there.

1. List item

0
Entering edit mode

Hi btsui, Thanks for you reply!

First of all, I've generated a few histograms with more bins (these are made from the whole datasets of 64 samples together, as suggested by Kevin below, I can also make them for a single sample if that is what you were looking for):

Histogram with the original number of bins, but with all 64 samples together:

Twice as many bins:

Five times as many bins:

Then I have a few questions: What is and how do I do z-scoring?

And a cut-off of 1, that is then based on a single sample I guess? And is this based on raw counts or something else?

I'm also trying to filter out the highest STD genes as you suggested, using the following code with an totally arbitrary cut-off of 10:

keep <- rowSds(counts(dds)) >= 10
dds_STD10 <- dds[keep,]


Of the 63677 original genes, I am then left with 15894 genes. Would you say that the way I do this is correct?

Then, my histogram looks like this:

I've generated a PCA on the data as is, with the histogram that I've provided before, and I actually think it already looks nice, but now I want it to be solid of course, based on actually normallly distributed data.

I've generated three PCA plots (plotPCA(vsd,intgroup="condition")) with their own histograms, 1 based on all the genes, 1 based on the genes for which rowSums>=10 and one for which rowSds>=10:

all genes (63677 genes):

rowSums>=10 (31748 genes):

rowSds>=10 (15894 genes):

All three PCA plots look very very similar I would say (there are tiny difference, but the shape is very similar). I guess this makes sense, since PCA looks at the genes with the highest STD anyway, and ignores the genes that are lowly expressed in all samples and the genes that have a low STD? And I'm expecting the groups to be very similar also, so I guess it makes sense that the PCA plot looks like this.