Scale and Center [normalized] RNA-seq expression counts for PCA ?
1
1
Entering edit mode
6.7 years ago
gaelgarcia ▴ 200

I have a dataset of hundreds of different samples and their normalized (by library size) feature counts.

I want to perform downstream analysis on these data, starting with PCA using prcomp. Should I center and scale the values before PCA, or is the normalization of reads enough?

Thanks!

RNA-Seq pca R sequencing genome • 10k views
1
Entering edit mode

You can use log transformed values and also add pseudocounts  to reduce the bias towards highly expressed transcripts. Always higher values (expression) dominates the variation levels between the samples than the lower values (or less expressed transcripts). If you do not use pseudocount the results will be completely based on highly expressed transcripts. This plays a major role when you do analysis on less expressed transcripts especially the non-coding RNAs along with protein coding ones.

0
Entering edit mode

Thanks. I wasn't referring to log-transforming, but to "scaling" and "centering" which are standarization options when performing PCA. Since I am not visualizing the expression counts, but rather the samples and their coordinates in PCA space, I don't think it makes any difference if I log-transform the data in this case.

0
Entering edit mode

I could not able to explain you properly but you will get better explanation here :

0
Entering edit mode

Helloo Kevin,

I really like your way of explaining and clarifying concepts. So i thought to ask you only about my issue, can you please respond me in your busy schedule Thanks .

My question is : I have 57 tumor samples, and 3 healthy samples and 3 carrier samples (as condition in Deseq colData) . I have created VSD by using vst function on DEseq object and tried to find sample mixup by using PCA .

After running PlotPCA and Pheatmap , i got PCA plot and heatmap (attached), and it showed mixup of tumor and healthy samples.

I was wondering How can i correct it m so that samples visible seperated (clustered) on PCA plot ? Thanks.

0
Entering edit mode

Hello kevin, here are link for PCA and heamap plots:

PCA- https://ibb.co/YhXKnL6 Heatmap - https://ibb.co/86N4kbb

0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

0
Entering edit mode

I apolozise Kevin, as i'm first time user on this Biostar, i didn't know about it.

0
Entering edit mode
3.2 years ago

That option of prcomp just helps to 'iron out' the 'bumps' in the data. If you don't perform it, you may very well observe PC1 representing more variance in the dataset than it normally would. This is why we would typically perform PCA on logged counts in the first place, i.e., because the distributon of logged data is more 'smooth' than that of unlogged.

I have written more about PCA here:

Kevin

0
Entering edit mode

Hello Kevin, I took logtransformed count data and performed PCA by providing Condition ATL, Healthy and carrier. Still PCA showed mixup of Healthy with ATL . Whether i need to dicard Healthy sample ? ,,, I dont want to discard healthy samples as i have only 3 normal samples and I need to do Differential expression using DESEq with contrast ATL vs Healthy . Can you suggest me how can i proceed , a clear idea .. Thank you !

0
Entering edit mode

Hi, it is not necessarily a bad thing for your sample groups to mix on a PCA bi-plot. In fact, it may very well be a positive sign that indicates your dataset is ready for further analysis. Unfortunately, I cannot make any definitive comments on your data without seeing relevant code, the actual bi-plot (with explained variation for each PC being plot), and, generally, understanding more about your project.

0
Entering edit mode

Hi again. I now see your other comment (above). I would remove the 2 samples at the right, and then re-process the data.

Your dataset is heavily imbalanced, by the way, and this will be reflected by 'crazy' p-values when you proceed to the differential expression analysis - wait and see.

0
Entering edit mode

Dear Kevin,

Please find below R code I have 57 ATL samples, 3 carrier and 3 Normal ( Do i need to provide Sample Data matrix , just worrying about data security , so not sending).... I need to to Differential expression between ATL vs Normal Please find the attached biplot and scatterplot3d for reference . Green color is ATL, Red is carrier and yellow is Normal. Can you please help me on following issues: 1) In PCA Normal samples mixed up in ATL , Whether can i proceed for Differential expression using these (even they mixup) DESEq and Enhanced plot for visualization ? or is there any serious issue ? 2) If there is issue m can you help me in soving this issue. 3) Whether i need to remove 2 ATL samples at right ? 4) How to Normalize data , So i can use that data for differential expression ? 5) You said my data is heavily imbalanced, how to should i balance it. 6) In Enhanced plot, what is the value i need to put at pCutoff = , so i set 0.05 ?

I really sorry for so many questions, i trust your responsem so asked ... Thanks Below im sending R code and plots

0
Entering edit mode

# -------------Read count file and column data file ------------

library(dplyr) countdata_final= distinct(countdata, Ensemble.ID, .keep_all = TRUE)

countdata_final_log= log2(countdata_final+1)

write.csv(countdata_final, "HTLV_countdata.csv")

write.csv(countdata_final_log, "HTLV_countdata_log.csv")

rownames(countdata_final)= countdata_final$Ensemble.ID abline(h=median(as.matrix(logcount_data)), col="blue") columndata=read.csv("column_data.csv", header = T) # Setting up the DESeqDataSet Object and run the DESeq pipeline library(DESeq2) HTLV_dds_object = DESeqDataSetFromMatrix(countData=countdata_final, colData=columndata, design=~condition) HTLV_dds_object$condition= relevel(HTLV_dds_object$condition, "Healthy_Volunteer") HTLV_dds = DESeq(HTLV_dds_object) # --------------------PCA Analysis ------------------------------------- counts.sf_normalized = counts(HTLV_dds , normalized = TRUE) colnames(counts.sf_normalized) = c(rep("ATL", 57), rep("Carrier", 3), rep("healthy", 3)) type <- list( rownames(t(counts.sf_normalized))) type <- gsub("[0-9]*$", "", rownames(t(counts.sf_normalized))) type

type <- factor(type, levels = c("ATL","Carrier","healthy")) type

colType <- c("forestgreen", "red2", "yellow")[type] colType

library(PCAtools)

project.pca <- prcomp(t(counts.sf_normalized)) summary(project.pca)

project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100)) par(cex=1.0, cex.axis=0.8, cex.main=0.8) pairs(project.pca$x[,1:5], col=colType, main="Principal components analysis bi-plot\nPCs 1-5", pch=16) pairs(project.pca$x[,6:10], col=colType, main="Principal components analysis bi-plot\nPCs 6-10", pch=16) plot(project.pca$x, type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%")) points(project.pca$x, col=colType, pch=16, cex=1) require(scatterplot3d) par(mar=c(4,4,4,4), cex=1.0, cex.main=0.8, cex.axis=0.8) scatterplot3d(project.pca$x[,1:3], angle=-40, main="", color=colType, pch=17, xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"), zlab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"), grid=FALSE, box=FALSE)

<h6>another PCA way</h6>

VSD = vst(HTLV_dds, blind = FALSE)

library(ggrepel)

VSD2= log(VSD)

plotPCA(VSD2, "condition")

<h6>another PCA way</h6>

# -----------PCA ----------------------------

resultsNames(HTLV_dds)

ATL_Normal_res= results(HTLV_dds, contrast = c("condition", "ATL", "Healthy_Volunteer"))

# -----------------Now lets considered ATL vs Normal ---------------------------

ATL_Normal_res= results(dds, contrast = c("condition", "ATL", "Healthy_Volunteer"))

ix = which.min(ATL_Normal_res$padj) # most significant ATL_Normal_res <-ATL_Normal_res[order(ATL_Normal_res$padj),] # sort library(knitr) kable(ATL_Normal_res[1:5,-(3:4)])

barplot(assay(HTLV_dds_object)[ix,],las=2, main=rownames(HTLV_dds_object)[ ix ] )

# Annotatae now with Annotationdbi

library("AnnotationDbi") library("org.Hs.eg.db") columns(org.Hs.eg.db)

ATL_Normal_res$symbol = mapIds(org.Hs.eg.db, keys=row.names(ATL_Normal_res), column="SYMBOL", keytype="ENSEMBL", multiVals="first") ATL_Normal_res$entrez = mapIds(org.Hs.eg.db, keys=row.names(ATL_Normal_res), column="ENTREZID", keytype="ENSEMBL", multiVals="first") ATL_Normal_res$name = mapIds(org.Hs.eg.db, keys=row.names(ATL_Normal_res), column="GENENAME", keytype="ENSEMBL", multiVals="first") library(EnhancedVolcano) write.csv(ATL_Normal_res, file = "ATL_Normal_result_DEG.csv") ATL_Normal_res_DEG= as.data.frame(ATL_Normal_res) ATL_Normal_res_DEG_final= distinct(ATL_Normal_res_DEG, symbol, .keep_all = TRUE) ATL_Normal_res_DEG_final=ATL_Normal_res_DEG_final %>% drop_na(symbol) rownames(ATL_Normal_res_DEG_final)= ATL_Normal_res_DEG_final$symbol

EnhancedVolcano(ATL_Normal_res_DEG_final, lab = rownames(ATL_Normal_res_DEG_final), x = 'log2FoldChange', y = 'pvalue', pCutoff = 0.05, FCcutoff = 2)

0
Entering edit mode

0
Entering edit mode
0
Entering edit mode

Hi! Just some points:

## [a]

When you load PCAtools via library(PCAtools), you are then not using the functions from the package; instead, you are using prcomp() and base R plotting functions - you are probably following my other Biostars post? The actual PCAtools functions are elaborated here: http://www.bioconductor.org/packages/release/bioc/vignettes/PCAtools/inst/doc/PCAtools.html

Also, for input to all PCA functions, I would use assay(VSD), not counts.sf_normalized

## [b]

in your 'another PCA way' section, you do not need to log VSD via:

VSD2= log(VSD)


You should just use assay(VSD) without any further transformations.

## [c]

When generating a results table, you should be using log fold-change shrinkage (lfcShrink()). Take a look here: Quick start.

In your case, there should be a coefficient called 'condition_ATL_vs_Healthy_Volunteer', which you can use.

## [d]

There is still evidence of an outlier in your PCA bi-plots along PC1. I would remove that sample.

## [e]

Always check and double-check that your input objects are perfectly aligned in name and order, i.e.,:

• countdata_final
• columndata

Also, this should return TRUE:

all(rownames(columndata) == colnames(countdata_final))

0
Entering edit mode

Dear Kevin, Thank you for your time !

I tried PCAtools but getting error after running screeplot or biplot on p. I also tried your documentation example and same error i am getting, below is code and error.

library(PCAtools)

resultsNames(HTLV_dds)

HTLV_dds
columndata
countdata_final
all(rownames(columndata) == colnames(countdata_final))
HTLV_VSD = vst(HTLV_dds, blind = FALSE)

count_vsd = assay(HTLV_VSD)

all(rownames(columndata) == colnames(count_vsd))

p <- pca(count_vsd, metadata=columndata, removeVar = 0.1)

screeplot(p)

biplot(p)
#---------Error ----------------------------------------------

> biplot(p)
Error: $operator not defined for this S4 class #---------Error ----------------------------------------------  I can remove ATL samples (1 or 2 ) but dont want to remove Normal samples (as only 3 samples) to do DIfferential expression. Can you suggest me , can i proceed to do DESEq using same data(just by removing right side 2 outlier samples) ? ADD REPLY 0 Entering edit mode Regarding PCAtools, there could be a namespace issue. You could try: p <- PCAtools::pca(count_vsd, metadata=columndata, removeVar = 0.1) PCAtools::screeplot(p) PCAtools::biplot(p)  Can you suggest me , can i proceed to do DESEq using same data(just by removing right side 2 outlier samples) ? I suggest that you re-generate the PCA bi-plot to see if there still exists an outlier (or outliers). ADD REPLY 0 Entering edit mode Dear Kevin, As per your suggestion, i performed PCAtools , please find the attached biplot and other plots and links : # ---------------------Code ------------------------- library(PCAtools) columndata_pca= read.csv("column_data_forPCA.csv", header = T) rownames(columndata_pca) = columndata_pca$ID

rownames(countdata_PCA)= countdata_PCA$ID library(DESeq2) HTLV_dds_object = DESeqDataSetFromMatrix(countData=countdata_PCA, colData=columndata_pca, design=~condition) HTLV_dds_object$condition= relevel(HTLV_dds_object\$condition, "Healthy_Volunteer")

HTLV_dds = DESeq(HTLV_dds_object)

resultsNames(HTLV_dds)

HTLV_dds countdata_PCA columndata_pca

HTLV_VSD = vst(HTLV_dds, blind = FALSE)

count_vsd = assay(HTLV_VSD)

all(rownames(columndata_pca) == colnames(count_vsd))

p <- PCAtools::pca(count_vsd, metadata=columndata_pca, removeVar = 0.1)

PCAtools::screeplot(p) PCAtools::biplot(p) PCAtools::screeplot(p) # screeplot(p)

PCAtools::pairsplot(p)

# ---------------------Code -------------------------

It will be great help , if you let me know what should i do next, to do Deseq Differential expression (without removing Healthy samples) . plz . Thanks !

0
Entering edit mode

Hi, there is still clear evidence of 2 outlier samples:

• ATL21
• ATL48

You should remove these from the raw data stage, and then re-process the data.

0
Entering edit mode

Dear Kevin, I really appreciate your suggestions and time ... Thank you :)

I removed 2 outliers ... ATL21, and ATL48 and reploted Biplots again ( please find below)

1) Can you tell me , Whether I need to remove any samples? ... I don't want to remove (Healthy samples) . 2) Is there any way , That without removing any samples , can i proceed to do Differential expression using Deseq. 3) How can i plot 3D plot of PCA .

Heatmap - https://ibb.co/d4c0YhS .... Biplot - https://ibb.co/ydPY9BJ .... Pairsplot - https://ibb.co/HnmrN5g ... Screenplot - https://ibb.co/jDRmX8r ...

0
Entering edit mode

Data looks good now.

1
Entering edit mode

Kevin, You are Amazing :) .

Then i will proceed for DIfferential expression analysis .

Thank you for your time and help :) !

0
Entering edit mode

Sure thing - hope that it all works out!

0
Entering edit mode

Dear Kevin,

I have an issue with my current analysis and is related with the scale argument in the PCAtools package. In brief, I'm analyzing data from 6 distinct cell lines exposed to compound "B". I have three replicates for each experimental group (CT and Tx) for each cell line. After normalization in DESeq I saved the vst transformed counts to perform data exploration with PCAtools. However, when I set scale = F in the pca function and plot the results using the biplot my data is grouped according the cell line, i.e CT and Tx for each cell line cluster together and the separation of data is according the cell line. A similar result is observed by using plotPCA from DESeq2. However, when I set 'scale = T my data is grouped respect to the treatment (using PCA tools).

This is a snippet of my code:

dds <- DESeqDataSetFromMatrix(countData = my_counts, colData = metadata, design = ~cell + condition)
dds <- DESeq(dds)
norm_counts <- counts(dds, normalized = T)
vsd <- vst(dds, blind = T)



0
Entering edit mode

Hey, that makes sense because the DESeq2 plotPCA() function does not invoke scaling, either - please see https://support.bioconductor.org/p/100069/#100074

After you scale, I imagine that the percent variation on PC1 and/or PC2 changes a lot. Is it possible to share these PCA bi-plots before and after scaling?

0
Entering edit mode

Hello Kevin,

Thanks for your help and response. You are right, plotPCA from DESeq2 does not scale the data. I made a mistake in the previous message. I employed the rlogTransformation instead vst. Yeah, sure!, please find below the PCA bi-plots before and after scaling.

Best regards!

1
Entering edit mode

Gracias amigo! hmmm... Is there any other variable (e.g. batch) that could explain the separation in the scaled PCA bi-plot? Are you aware that DESeq2's function for PCA automatically removes a substantial proportion of variables prior to performing PCA?

0
Entering edit mode

¡Por nada amigo! Thanks to you Kevin. In the experimental design, we controlled or avoided batch effects. For example, in the bi-plot of scaled data, points of the same color (i.e the same cell line CT or Tx) belong to different sequencing runs. I have already reviewed the source code of plotPCA and as I understood it only takes the top 500 genes with highest variance as default. I guess that even after transforming the counts, there are high variance due to the type of cell line which could dominate over the variance due to treatment, so scaling prevents this result...?

Thanks again Kevin!

1
Entering edit mode

Si, there is also that aspect of the DESeq2 plotPCA() function, i.e., it automatically selects the top n genes based on variance. You can disable this, though.

The way that scaling alters a dataset is clear via this example:

x <- matrix(rexp(200, rate=.1), ncol=20)

# get variance of each column (sample)
apply(x, 2, var)
[1]  25.528056  34.523421  37.149345  88.592867 524.478997  30.349629  49.728655 205.190988 115.367114 107.676855  56.678798  79.674791   8.239216  58.639782
[15]  56.981929  85.714589 254.761586 117.739202 199.789468 101.299346

# get variance of column-scaled data
apply(scale(x, center = TRUE, scale = TRUE), 2, var)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


So, it results in the samples (columns) having unit variance.

For heatmaps, we'd usually scale by row, and have to transpose:

x <- matrix(rexp(200, rate=.1), ncol=20)

# get variance of each row (gene)
apply(x, 1, var)
[1]  68.13476  44.61792  84.84580  44.07503 106.90984 161.49444  49.83009 148.43526 350.87706  54.73157

# get variance of row-scaled data
apply(t(scale(t(x), center = TRUE, scale = TRUE)), 1, var)
[1] 1 1 1 1 1 1 1 1 1 1


Now genes (rows) have unit variance.

It's difficult to say what is happening in your data, but I think that the 'real' result that reflects best your data will be the unscaled one. If the cell-line effect is strongest in this, then it is the cell-line effect that is strongest in your data.

You can investigate further via PCAtools::plotloadings(), which will show you then genes responsible for the segregation on each PC.

1
Entering edit mode

Hello Kevin,

I apologize for my delayed response. During this weekend I reproduced the source code of plotPCA and performed the analysis with scaling and not scaling the data. I disabled the ntop = 500 by running ntop = nrow(rld)` to capture the variance of all genes. After investigating, despite the cell lines belong to CRC some of them are from a particular histological type and perhaps this variable is responsible for the observed result.

Best regards