Hi all! I've been trying to do PCA for my RNAseq data to examine the relationship between my samples. I have been asking questions under a different post and Kevin Blighe suggested I start a new post, so here it is.
I have RNAseq data from 3 types of tissues, one disease tissue (n=44) and two potential control tissues (n=16 and n=4). I want to do some analysis, and one of which is PCA to get some info on the relationship between the groups and to see if one of the two potential controls is more similar to the disease tissue in gene expression.
So, I have raw readcounts for these 64 samples.
Kevin Blighe suggested using DESeq2 and I've used the following code:
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ batch + condition) dds <- DESeq(dds) normalized_counts <- counts(dds, normalized=TRUE) vsd <- vst(dds, blind=TRUE) assay(vsd)
In this code I used 'batch' because I have also some other samples from other studies, that were therefore not part of the same experiment and I thought that a factor batch (defined in coldata) could help to remove some of the effect of these different experiments. Please tell me if this is wrong.
Then I put the vsd data in a data frame to be able to generate histograms of each column (sample):
counts_norm_logged <- as.data.frame(assay(vsd)) hist(counts_norm_logged[,1])
I used a loop to generate histograms for all samples, and the distribution looks very similar between all these samples. This is an example:
So, there is a very large bar that represents all the genes that are not expressed in the sample, because their normalized logged count becomes a value around 4.
If I zoom in by cutting the y-axis at 10000, you get this image:
So, something is not really right here since my data should be more or less normally distributed now, after the normalization and logging.
There is one thing that might be related to this that I've wondered about for a while, since someone else mapped my data. The count data that I received back contains 63677 genes, while datasets I've worked with before contained 45274 genes. I've now googled it and I think that the human genome is not supposed to have that many genes...? Or am I mistaken? UPDATE: I've run the same code on some old datasets with 45274 genes and it gives a similar distribution, just the first bar is lower because it contains less genes (so I do think that the additional 20000 genes (almost) all have zero counts). Here is an example of these data with the same code:
That said, it might not be clear that I have not removed lowly expressed genes from these datasets. Should I have done this? And if so, what would be a good method and cut-off for this, since I've struggled with this before since it seems so arbitrary, at least to me, a person who is not to knowledgeable in this field.
Thank you in advance! I really hope someone can help me with this!