Question: input counts to PCA in RNAseq
gravatar for elizabethR
3.3 years ago by
elizabethR70 wrote:


I would like to plot RNASeq data that I have downloaded from TCGA in a PCA plot. I have found some great guides on how to plot the actual data in PCA using r in ggplot2 and such but my main question is what format data should I plot?

I currently have raw counts and RSEM data. Should I input raw counts into something like edgeR or deseq2 and filter for expression by cpm first? Should I normalise it? Should I stabilise variance using rlog2? Or convert to TPM and plot that? Argh I'm so confused. Grateful for any advice you can give me :)

rna-seq pca ggplot2 R • 5.3k views
ADD COMMENTlink modified 22 months ago by Kevin Blighe69k • written 3.3 years ago by elizabethR70

I am not familiar with DESeq2, I have been using edgeR up to now. I have just been reading the manuals and online tutorials and looking at how to input the data. I see it will accept a count matrix such as the one I have in csv format, but that it needs a metadata file. I am really not sure how to make one of these or what it must contain. Can anyone advise on how I can do this?

Thanks again in advance

ADD REPLYlink written 3.3 years ago by elizabethR70

As an alternative to PCA you can also try MDS plots : but it should give similar results

ADD REPLYlink written 3.3 years ago by Corentin450

Hello, I update this topic because I have another question

rLogTransofrmation is fine and I have do this with my data but does DESeq2 deal with the difference in the number of reads between samples?

Because I have 3 samples from 2 conditions, the 3 samples from the first condition have all ~ 10 000 000 reads and the 3 others have ~13 000 000, so I don't know if the clusters on my pca come from biological difference (that I hope aha), or from difference of the number of reads.

ADD REPLYlink written 23 months ago by darbinator230

I moved that one to a comment. Please open a new thread for such questions instead of refreshing older ones. Still, this question has been asked before, please use the search function and google. From the manual which you always should read first:


This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. The rlog transformation produces a similar variance stabilizing effect as varianceStabilizingTransformation, though rlog is more robust in the case when the size factors vary widely. The transformation is useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis. rlog takes as input a DESeqDataSet and returns a RangedSummarizedExperiment object.

So yes, it normalizes and is a recommended transformation for downstream applications such as PCA and clustering.

ADD REPLYlink modified 23 months ago • written 23 months ago by ATpoint44k
gravatar for plat
3.3 years ago by
plat50 wrote:

I would recommend you to input your raw counts into DESeq2, run the pipeline, convert normalized reads to rlog (regularized log transformed reads) and then just run the plotPCA function from DESeq2. It is very easy if you are familiarized with the program.

 # Creating deseq2 object
 dds <- DESeqDataSetFromMatrix(countData = inputData, 
                            colData = samples, 
                            design = design)

dds <- DESeq(dds, betaPrior = betaPrior)
# Regularized log transformation for different analysis (clustering, heatmaps, etc)
rld <- rlogTransformation(dds)
pca <- plotPCA(rld, intgroup = c(colGroups))

The idea behind using rlog transformation for Quality Control checks is described in DESeq2 paper: "[...] Therefore, we use the shrinkage approach of DESeq2 to implement a regularized logarithm transformation (rlog), which behaves similarly to a log2 transformation for genes with high counts, while shrinking together the values for different samples for genes with low counts. It therefore avoids a commonly observed property of the standard logarithm transformation, the spreading apart of data for genes with low counts, where random noise is likely to dominate any biologically meaningful signal[...]"

ADD COMMENTlink written 3.3 years ago by plat50
gravatar for Kevin Blighe
3.3 years ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

Dear Elizabeth,

In the simplest scenario (4 samples; 4 genes; 1 experimental condition), your metadata object, which you may have to read in from a file, could look like:

ID  Condition

...whilst your counts file could look like:

    MA  MB  PA  PB
gene1   45  46  25  22
gene2   45  45  45  44
gene3   10  10  9   4
gene4   88  67  34  44

This could then be read into DESeq2 as:

dds <- DESeqDataSetFromMatrix(rawcounts, colData=metadata, design=~Condition)
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Kevin Blighe69k
gravatar for -_-
22 months ago by
-_-870 wrote:

I built an app to facilitate visualization of TCGA RNA-Seq data, may be helpful to your similar use cases,

ADD COMMENTlink written 22 months ago by -_-870
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1080 users visited in the last hour