Question: Scale and Center [normalized] RNA-seq expression counts for PCA ?
1
gravatar for gaelgarcia
5.3 years ago by
gaelgarcia170
UK
gaelgarcia170 wrote:

I have a dataset of hundreds of different samples and their normalized (by library size) feature counts.

I want to perform downstream analysis on these data, starting with PCA using `prcomp`. Should I center and scale the values before PCA, or is the normalization of reads enough?

Thanks!

sequencing rna-seq pca R genome • 7.8k views
ADD COMMENTlink modified 15 days ago by majeedmj.ict10 • written 5.3 years ago by gaelgarcia170
1

You can use log transformed values and also add pseudocounts  to reduce the bias towards highly expressed transcripts. Always higher values (expression) dominates the variation levels between the samples than the lower values (or less expressed transcripts). If you do not use pseudocount the results will be completely based on highly expressed transcripts. This plays a major role when you do analysis on less expressed transcripts especially the non-coding RNAs along with protein coding ones.

ADD REPLYlink written 5.3 years ago by EagleEye6.6k

Thanks. I wasn't referring to log-transforming, but to "scaling" and "centering" which are standarization options when performing PCA. Since I am not visualizing the expression counts, but rather the samples and their coordinates in PCA space, I don't think it makes any difference if I log-transform the data in this case.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by gaelgarcia170

I could not able to explain you properly but you will get better explanation here : 

https://www.researchgate.net/post/What_is_the_best_way_to_scale_parameters_before_running_a_Principal_Component_Analysis_PCA 

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by EagleEye6.6k

Helloo Kevin,

I really like your way of explaining and clarifying concepts. So i thought to ask you only about my issue, can you please respond me in your busy schedule Thanks .

My question is : I have 57 tumor samples, and 3 healthy samples and 3 carrier samples (as condition in Deseq colData) . I have created VSD by using vst function on DEseq object and tried to find sample mixup by using PCA .

After running PlotPCA and Pheatmap , i got PCA plot and heatmap (attached), and it showed mixup of tumor and healthy samples.

I was wondering How can i correct it m so that samples visible seperated (clustered) on PCA plot ? Thanks.

PCA plot

Heatmap

ADD REPLYlink written 15 days ago by majeedmj.ict10

Hello kevin, here are link for PCA and heamap plots:

PCA- https://ibb.co/YhXKnL6 Heatmap - https://ibb.co/86N4kbb

ADD REPLYlink written 15 days ago by majeedmj.ict10

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLYlink written 15 days ago by genomax84k

I apolozise Kevin, as i'm first time user on this Biostar, i didn't know about it.

ADD REPLYlink written 14 days ago by majeedmj.ict10
0
gravatar for Kevin Blighe
21 months ago by
Kevin Blighe60k
Kevin Blighe60k wrote:

That option of prcomp just helps to 'iron out' the 'bumps' in the data. If you don't perform it, you may very well observe PC1 representing more variance in the dataset than it normally would. This is why we would typically perform PCA on logged counts in the first place, i.e., because the distributon of logged data is more 'smooth' than that of unlogged.

I have written more about PCA here:

Kevin

ADD COMMENTlink written 21 months ago by Kevin Blighe60k

Hello Kevin, I took logtransformed count data and performed PCA by providing Condition ATL, Healthy and carrier. Still PCA showed mixup of Healthy with ATL . Whether i need to dicard Healthy sample ? ,,, I dont want to discard healthy samples as i have only 3 normal samples and I need to do Differential expression using DESEq with contrast ATL vs Healthy . Can you suggest me how can i proceed , a clear idea .. Thank you !

ADD REPLYlink written 14 days ago by majeedmj.ict10

Hi, it is not necessarily a bad thing for your sample groups to mix on a PCA bi-plot. In fact, it may very well be a positive sign that indicates your dataset is ready for further analysis. Unfortunately, I cannot make any definitive comments on your data without seeing relevant code, the actual bi-plot (with explained variation for each PC being plot), and, generally, understanding more about your project.

ADD REPLYlink written 14 days ago by Kevin Blighe60k

Hi again. I now see your other comment (above). I would remove the 2 samples at the right, and then re-process the data.

Your dataset is heavily imbalanced, by the way, and this will be reflected by 'crazy' p-values when you proceed to the differential expression analysis - wait and see.

ADD REPLYlink modified 14 days ago • written 14 days ago by Kevin Blighe60k

Dear Kevin,

Please find below R code I have 57 ATL samples, 3 carrier and 3 Normal ( Do i need to provide Sample Data matrix , just worrying about data security , so not sending).... I need to to Differential expression between ATL vs Normal Please find the attached biplot and scatterplot3d for reference . Green color is ATL, Red is carrier and yellow is Normal. Can you please help me on following issues: 1) In PCA Normal samples mixed up in ATL , Whether can i proceed for Differential expression using these (even they mixup) DESEq and Enhanced plot for visualization ? or is there any serious issue ? 2) If there is issue m can you help me in soving this issue. 3) Whether i need to remove 2 ATL samples at right ? 4) How to Normalize data , So i can use that data for differential expression ? 5) You said my data is heavily imbalanced, how to should i balance it. 6) In Enhanced plot, what is the value i need to put at pCutoff = , so i set 0.05 ?

I really sorry for so many questions, i trust your responsem so asked ... Thanks Below im sending R code and plots

ADD REPLYlink written 14 days ago by majeedmj.ict10

------------- R- code -------------------------------------

-------------Read count file and column data file ------------

countdata= read.csv("HTLV_RNAseq_count_data.csv", header = T)

library(dplyr) countdata_final= distinct(countdata, Ensemble.ID, .keep_all = TRUE)

countdata_final_log= log2(countdata_final+1)

write.csv(countdata_final, "HTLV_countdata.csv")

write.csv(countdata_final_log, "HTLV_countdata_log.csv")

rownames(countdata_final)= countdata_final$Ensemble.ID

abline(h=median(as.matrix(logcount_data)), col="blue")

columndata=read.csv("column_data.csv", header = T)

Setting up the DESeqDataSet Object and run the DESeq pipeline

library(DESeq2) HTLV_dds_object = DESeqDataSetFromMatrix(countData=countdata_final, colData=columndata, design=~condition)

HTLV_dds_object$condition= relevel(HTLV_dds_object$condition, "Healthy_Volunteer")

HTLV_dds = DESeq(HTLV_dds_object)

--------------------PCA Analysis -------------------------------------

counts.sf_normalized = counts(HTLV_dds , normalized = TRUE)

colnames(counts.sf_normalized) = c(rep("ATL", 57), rep("Carrier", 3), rep("healthy", 3))

type <- list( rownames(t(counts.sf_normalized)))

type <- gsub("[0-9]*$", "", rownames(t(counts.sf_normalized))) type

type <- factor(type, levels = c("ATL","Carrier","healthy")) type

colType <- c("forestgreen", "red2", "yellow")[type] colType

library(PCAtools)

project.pca <- prcomp(t(counts.sf_normalized)) summary(project.pca)

project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100))

par(cex=1.0, cex.axis=0.8, cex.main=0.8) pairs(project.pca$x[,1:5], col=colType, main="Principal components analysis bi-plot\nPCs 1-5", pch=16) pairs(project.pca$x[,6:10], col=colType, main="Principal components analysis bi-plot\nPCs 6-10", pch=16)

plot(project.pca$x, type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%")) points(project.pca$x, col=colType, pch=16, cex=1)

require(scatterplot3d) par(mar=c(4,4,4,4), cex=1.0, cex.main=0.8, cex.axis=0.8)

scatterplot3d(project.pca$x[,1:3], angle=-40, main="", color=colType, pch=17, xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"), zlab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"), grid=FALSE, box=FALSE)

<h6>another PCA way</h6>

VSD = vst(HTLV_dds, blind = FALSE)

library(ggrepel)

VSD2= log(VSD)

plotPCA(VSD2, "condition")

<h6>another PCA way</h6>

-----------PCA ----------------------------

resultsNames(HTLV_dds)

ATL_Normal_res= results(HTLV_dds, contrast = c("condition", "ATL", "Healthy_Volunteer"))

-----------------Now lets considered ATL vs Normal ---------------------------

ATL_Normal_res= results(dds, contrast = c("condition", "ATL", "Healthy_Volunteer"))

ix = which.min(ATL_Normal_res$padj) # most significant

ATL_Normal_res <-ATL_Normal_res[order(ATL_Normal_res$padj),] # sort library(knitr) kable(ATL_Normal_res[1:5,-(3:4)])

barplot(assay(HTLV_dds_object)[ix,],las=2, main=rownames(HTLV_dds_object)[ ix ] )

Annotatae now with Annotationdbi

library("AnnotationDbi") library("org.Hs.eg.db") columns(org.Hs.eg.db)

ATL_Normal_res$symbol = mapIds(org.Hs.eg.db, keys=row.names(ATL_Normal_res), column="SYMBOL", keytype="ENSEMBL", multiVals="first") ATL_Normal_res$entrez = mapIds(org.Hs.eg.db, keys=row.names(ATL_Normal_res), column="ENTREZID", keytype="ENSEMBL", multiVals="first") ATL_Normal_res$name = mapIds(org.Hs.eg.db, keys=row.names(ATL_Normal_res), column="GENENAME", keytype="ENSEMBL", multiVals="first")

library(EnhancedVolcano)

write.csv(ATL_Normal_res, file = "ATL_Normal_result_DEG.csv")

ATL_Normal_res_DEG= as.data.frame(ATL_Normal_res)

ATL_Normal_res_DEG_final= distinct(ATL_Normal_res_DEG, symbol, .keep_all = TRUE)

ATL_Normal_res_DEG_final=ATL_Normal_res_DEG_final %>% drop_na(symbol)

rownames(ATL_Normal_res_DEG_final)= ATL_Normal_res_DEG_final$symbol

EnhancedVolcano(ATL_Normal_res_DEG_final, lab = rownames(ATL_Normal_res_DEG_final), x = 'log2FoldChange', y = 'pvalue', pCutoff = 0.05, FCcutoff = 2)

ADD REPLYlink written 14 days ago by majeedmj.ict10

Below PCA plots and link PCA1to5 PCA6-10 Scatter3d Screeplot

ADD REPLYlink written 14 days ago by majeedmj.ict10

https://ibb.co/8N2NYTn https://ibb.co/1TD54NL https://ibb.co/M2SXBhq https://ibb.co/v41ZjK8

ADD REPLYlink written 14 days ago by majeedmj.ict10

Hi! Just some points:

[a]

When you load PCAtools via library(PCAtools), you are then not using the functions from the package; instead, you are using prcomp() and base R plotting functions - you are probably following my other Biostars post? The actual PCAtools functions are elaborated here: http://www.bioconductor.org/packages/release/bioc/vignettes/PCAtools/inst/doc/PCAtools.html

Also, for input to all PCA functions, I would use assay(VSD), not counts.sf_normalized

[b]

in your 'another PCA way' section, you do not need to log VSD via:

VSD2= log(VSD)

You should just use assay(VSD) without any further transformations.

[c]

When generating a results table, you should be using log fold-change shrinkage (lfcShrink()). Take a look here: Quick start.

In your case, there should be a coefficient called 'condition_ATL_vs_Healthy_Volunteer', which you can use.

[d]

There is still evidence of an outlier in your PCA bi-plots along PC1. I would remove that sample.

[e]

Always check and double-check that your input objects are perfectly aligned in name and order, i.e.,:

  • countdata_final
  • columndata

Also, this should return TRUE:

all(rownames(columndata) == colnames(countdata_final))
ADD REPLYlink modified 14 days ago • written 14 days ago by Kevin Blighe60k

Dear Kevin, Thank you for your time !

I tried PCAtools but getting error after running screeplot or biplot on p. I also tried your documentation example and same error i am getting, below is code and error.

library(PCAtools)


resultsNames(HTLV_dds)

HTLV_dds
columndata
countdata_final
all(rownames(columndata) == colnames(countdata_final))
HTLV_VSD = vst(HTLV_dds, blind = FALSE)

count_vsd = assay(HTLV_VSD)

all(rownames(columndata) == colnames(count_vsd))

p <- pca(count_vsd, metadata=columndata, removeVar = 0.1)

screeplot(p)

biplot(p)
#---------Error ----------------------------------------------

> biplot(p)
Error: $ operator not defined for this S4 class

#---------Error ----------------------------------------------

I can remove ATL samples (1 or 2 ) but dont want to remove Normal samples (as only 3 samples) to do DIfferential expression.

Can you suggest me , can i proceed to do DESEq using same data(just by removing right side 2 outlier samples) ?

ADD REPLYlink modified 13 days ago by Kevin Blighe60k • written 13 days ago by majeedmj.ict10

Regarding PCAtools, there could be a namespace issue. You could try:

p <- PCAtools::pca(count_vsd, metadata=columndata, removeVar = 0.1)

PCAtools::screeplot(p)

PCAtools::biplot(p)

Can you suggest me , can i proceed to do DESEq using same data(just by removing right side 2 outlier samples) ?

I suggest that you re-generate the PCA bi-plot to see if there still exists an outlier (or outliers).

ADD REPLYlink written 13 days ago by Kevin Blighe60k

Dear Kevin, As per your suggestion, i performed PCAtools , please find the attached biplot and other plots and links :

---------------------Code -------------------------

library(PCAtools) columndata_pca= read.csv("column_data_forPCA.csv", header = T)

rownames(columndata_pca) = columndata_pca$ID

countdata_PCA= read.csv("HTLV_countdata_for_PCA.csv", header = T)

rownames(countdata_PCA)= countdata_PCA$ID

library(DESeq2) HTLV_dds_object = DESeqDataSetFromMatrix(countData=countdata_PCA, colData=columndata_pca, design=~condition)

HTLV_dds_object$condition= relevel(HTLV_dds_object$condition, "Healthy_Volunteer")

HTLV_dds = DESeq(HTLV_dds_object)

resultsNames(HTLV_dds)

HTLV_dds countdata_PCA columndata_pca

HTLV_VSD = vst(HTLV_dds, blind = FALSE)

count_vsd = assay(HTLV_VSD)

all(rownames(columndata_pca) == colnames(count_vsd))

p <- PCAtools::pca(count_vsd, metadata=columndata_pca, removeVar = 0.1)

PCAtools::screeplot(p) PCAtools::biplot(p) PCAtools::screeplot(p) # screeplot(p)

PCAtools::pairsplot(p)

---------------------Code -------------------------

https://ibb.co/bHWWk7B https://ibb.co/SyBmnVg https://ibb.co/5KxfjnL

pairplot

biplot screeplot

It will be great help , if you let me know what should i do next, to do Deseq Differential expression (without removing Healthy samples) . plz . Thanks !

ADD REPLYlink written 13 days ago by majeedmj.ict10

Hi, there is still clear evidence of 2 outlier samples:

  • ATL21
  • ATL48

You should remove these from the raw data stage, and then re-process the data.

ADD REPLYlink written 13 days ago by Kevin Blighe60k

Dear Kevin, I really appreciate your suggestions and time ... Thank you :)

I removed 2 outliers ... ATL21, and ATL48 and reploted Biplots again ( please find below)

1) Can you tell me , Whether I need to remove any samples? ... I don't want to remove (Healthy samples) . 2) Is there any way , That without removing any samples , can i proceed to do Differential expression using Deseq. 3) How can i plot 3D plot of PCA .

screeplot PCA1to5 biplot Heatmap

Heatmap - https://ibb.co/d4c0YhS .... Biplot - https://ibb.co/ydPY9BJ .... Pairsplot - https://ibb.co/HnmrN5g ... Screenplot - https://ibb.co/jDRmX8r ...

ADD REPLYlink written 13 days ago by majeedmj.ict10

Data looks good now.

ADD REPLYlink written 13 days ago by Kevin Blighe60k
1

Kevin, You are Amazing :) .

Then i will proceed for DIfferential expression analysis .

Thank you for your time and help :) !

ADD REPLYlink written 12 days ago by majeedmj.ict10

Sure thing - hope that it all works out!

ADD REPLYlink written 12 days ago by Kevin Blighe60k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1823 users visited in the last hour