Different results from MDS and PCA plots for DE analysis
2
0
Entering edit mode
3.3 years ago

Hi all,

I have some difficulty understanding my results from DE analysis. Briefly, I have 2 groups (Dneg & Dpos) with 2 replicates each (total 4 samples). I performed RNA-seq and then used kallisto for transcript quantification. I imported the data using tximport with the scaledTPM option to get gene-level counts.

For DE, I used edgeR to perform the analysis. Upon exploration of the data using MDS plot, I found that the 2 replicates in Dpos are very dissimilar to each other. However, when I plot a PCA plot, I got the inverse which shows that my 2 Dneg replicates are dissimilar. I was afraid I did some mistake so I did DE analysis using DESeq2 and I got the same results.

summary

From the looks of the plots, I'm sure 1 of my samples is an outlier or problematic, but I don't know whether it is the one from the Dneg group or the one from the Dpos group. The code I used for my analysis is as below:

(1) Read in counts data

load("~/txi.kallisto.Rdata")
head(txi.kallisto$counts, n = 5)

                     Dneg_1      Dneg_9      Dpos_10      Dpos_2 
ENSG00000000003   497.45843   579.17559     563.3364   571.37096
ENSG00000000005     0.00000     0.00000       0.0000     0.00000
ENSG00000000419  1451.35256  1764.77228    1503.6384  1427.12914
ENSG00000000457    69.25668    66.55058      74.7108    78.89496
ENSG00000000460   244.89625   263.26950     182.9352   200.74267

(2) edgeR analysis

# Prepare DGEList object
cts <- txi.kallisto$counts
group <- c("Dneg", "Dneg", "Dpos", "Dpos")
df.edgeR <- DGEList(cts, group = group)
symbol <- mapIds(org.Hs.eg.db, keys = rownames(df.edgeR), keytype = "ENSEMBL", column = "SYMBOL")
df.edgeR$genes <- data.frame(geneSYM = symbol)

# Filter out lowly expressed genes
keep <- filterByExpr(df.edgeR, group = group)
df.edgeR <- df.edgeR[keep, , keep.lib.sizes=FALSE]

# Normalization and estimate dispersions
df.edgeR <- calcNormFactors(df.edgeR)

# Plot MDS
groupPlot <- factor(c("Dneg", "Dneg", "Dpos", "Dpos"))
points <- c(15,16)
colors <- c("blue", "red")
plotMDS(df.edgeR, col = colors[groupPlot], pch = points[groupPlot])
legend("bottomright", legend = levels(groupPlot), pch = points, col = colors)

# Plot PCA
edgeR.DDS <- DESeqDataSetFromMatrix(countData = round(df.edgeR$counts), colData = df.edgeR$samples, design = ~group)
transform.edgeR.DDS <- rlog(edgeR.DDS, blind = TRUE)
pcaData <- plotPCA(transform.edgeR.DDS, intgroup = c("group"), returnData = TRUE, ntop = 1000)  
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = group)) +
  geom_point(size = 5) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
  coord_fixed() +
  scale_color_manual(values = c("blue", "red"))

(3) DESeq2 analysis

# Prepare table of sample ID and experimental condition
sampleInfo <- data.frame(sampleID, group, stringsAsFactors = FALSE)

# Convert counts to DESeq2 object
ddsTxi <- DESeqDataSetFromTximport(txi.kallisto, colData = sampleInfo, design = ~group)

# Pre-filtering to remove genes with low counts
keep <- rowSums(counts(ddsTxi)) >= 10
ddsTxi <- ddsTxi[keep, ]

# Log-transform data for visualization
transform.ds <- rlog(deseq.ds, blind = TRUE)

# Plot PCA
pcaData <- plotPCA(transform.ds, intgroup = c("group"), returnData = TRUE, ntop = 1000)   
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = group)) +
  geom_point(size = 5) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
  coord_fixed() +
  scale_color_manual(values = c("blue", "red"))

# Plot MDS
library(DEFormats)
deseqMDS <- as.DGEList(ddsTxi)
plotMDS(deseqMDS, col = colors[groupPlot], pch = points[groupPlot])
legend("topright", legend = levels(groupPlot), pch = points, col = colors)

My question here is: (1) Which result should I trust / use? The MDS or PCA plots? (2) Are there any better ways to check for sample similarity / difference?

Thank you very much!

deseq2 edger pca mds • 4.7k views
ADD COMMENT
1
Entering edit mode
3.3 years ago
Mensur Dlakic ★ 27k

First, while both PCA and MDS are dimensionality reduction techniques, the former is linear and the latter is non-linear. PCA transforms the data so that their relative distances are preserved in the embedding, while MDS ATEMPTS to preserve relative distances.

Second, i don't think you are transforming identical data, at least from what little R code I can read. There is a rlog function in one case but not the other, and I wonder if ntop = 1000 means that only the first 1000 data points are used for PCA. That would definitely make it a different comparison, especially if the number of genes is >>1000.

The code below illustrates my point. It makes 10000 random expression values in 5-250 range, and creates its matching dataset by randomly adding +/-5 noise points. Then it does the same thing one more time, thus creating two simulated positive and negative datasets. PCA will in most cases end up producing different results for all data (left panel) vs. only first 1000 data points (middle). MDS is different from both, but that is to be expected given the differences between the two transformations.

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import MDS
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
mds = MDS(n_components=2)

df = pd.DataFrame(np.random.uniform(low=5, high=250, size=10000), columns=['positive_1'])
df['positive_2'] = df['positive_1'].apply(lambda x: x + np.random.uniform(-5, 5))
df['negative_1'] = np.random.uniform(low=5, high=250, size=10000)
df['negative_2'] = df['negative_1'].apply(lambda x: x + np.random.uniform(-5, 5))
df[df.columns] = StandardScaler().fit_transform(df.values)
pca_df = pd.DataFrame(pca.fit_transform(df.transpose().values))
mds_df = pd.DataFrame(mds.fit_transform(df.transpose().values))
pca_df_part = pd.DataFrame(pca.fit_transform(df[:1000].transpose().values))
fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(131)
pca_df.plot(kind='scatter', x=0, y=1, c=['r','r','b','b'], ax=ax, title='PCA')
ax = fig.add_subplot(132)
pca_df_part.plot(kind='scatter', x=0, y=1, c=['r','r','b','b'], ax=ax, title='PCA_1000')
ax = fig.add_subplot(133)
mds_df.plot(kind='scatter', x=0, y=1, c=['r','r','b','b'], ax=ax, title='MDS')
plt.tight_layout()
plt.savefig('1st-comparison.png', dpi=150)
plt.show()

enter image description here

ADD COMMENT
0
Entering edit mode

Second, i don't think you are transforming identical data, at least from what little R code I can read. There is a rlog function in one case but not the other, and I wonder if ntop = 1000 means that only the first 1000 data points are used for PCA. That would definitely make it a different comparison, especially if the number of genes is >>1000.

No, I did rlog transformation for both data before plotting the PCA. For edgeR analysis, the rlog function is under the # Plot PCA comment.

ntop=100 is an argument for DESeq2 plotPCA function where it decides the number of top genes to use for principal components selected by highest row variance. I repeated and specified to use all available genes and still get the same results.

ADD REPLY
0
Entering edit mode
3.3 years ago

From the looks of the plots, I'm sure 1 of my samples is an outlier or problematic,

I do not believe that's possible to conclude based on an n of 2 for each group. Unless the people who did the benchwork can point you to a clear QC issue, it would be irresponsible to omit any sample.

Did you give DESeq TPM data, and not raw counts? Are you sure you are supposed to do that?

ADD COMMENT
0
Entering edit mode

I do not believe that's possible to conclude based on an n of 2 for each group. Unless the people who did the benchwork can point you to a clear QC issue, it would be irresponsible to omit any sample.

I understand but the sample was prepared and sequenced by a previous member about 3 years ago and he has since left the lab. It will be difficult to determine a clear QC issue.

Did you give DESeq TPM data, and not raw counts? Are you sure you are supposed to do that?

I think so as I am just following the method showed here: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#DESeq2

ADD REPLY

Login before adding your answer.

Traffic: 2370 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6