Question

Different results from MDS and PCA plots for DE analysis

0

Entering edit mode

3.3 years ago

benjytan1488 • 0

Hi all,

I have some difficulty understanding my results from DE analysis. Briefly, I have 2 groups (Dneg & Dpos) with 2 replicates each (total 4 samples). I performed RNA-seq and then used kallisto for transcript quantification. I imported the data using tximport with the scaledTPM option to get gene-level counts.

For DE, I used edgeR to perform the analysis. Upon exploration of the data using MDS plot, I found that the 2 replicates in Dpos are very dissimilar to each other. However, when I plot a PCA plot, I got the inverse which shows that my 2 Dneg replicates are dissimilar. I was afraid I did some mistake so I did DE analysis using DESeq2 and I got the same results.

From the looks of the plots, I'm sure 1 of my samples is an outlier or problematic, but I don't know whether it is the one from the Dneg group or the one from the Dpos group. The code I used for my analysis is as below:

(1) Read in counts data

load("~/txi.kallisto.Rdata")
head(txi.kallisto$counts, n = 5)

                     Dneg_1      Dneg_9      Dpos_10      Dpos_2 
ENSG00000000003   497.45843   579.17559     563.3364   571.37096
ENSG00000000005     0.00000     0.00000       0.0000     0.00000
ENSG00000000419  1451.35256  1764.77228    1503.6384  1427.12914
ENSG00000000457    69.25668    66.55058      74.7108    78.89496
ENSG00000000460   244.89625   263.26950     182.9352   200.74267

(2) edgeR analysis

# Prepare DGEList object
cts <- txi.kallisto$counts
group <- c("Dneg", "Dneg", "Dpos", "Dpos")
df.edgeR <- DGEList(cts, group = group)
symbol <- mapIds(org.Hs.eg.db, keys = rownames(df.edgeR), keytype = "ENSEMBL", column = "SYMBOL")
df.edgeR$genes <- data.frame(geneSYM = symbol)

# Filter out lowly expressed genes
keep <- filterByExpr(df.edgeR, group = group)
df.edgeR <- df.edgeR[keep, , keep.lib.sizes=FALSE]

# Normalization and estimate dispersions
df.edgeR <- calcNormFactors(df.edgeR)

# Plot MDS
groupPlot <- factor(c("Dneg", "Dneg", "Dpos", "Dpos"))
points <- c(15,16)
colors <- c("blue", "red")
plotMDS(df.edgeR, col = colors[groupPlot], pch = points[groupPlot])
legend("bottomright", legend = levels(groupPlot), pch = points, col = colors)

# Plot PCA
edgeR.DDS <- DESeqDataSetFromMatrix(countData = round(df.edgeR$counts), colData = df.edgeR$samples, design = ~group)
transform.edgeR.DDS <- rlog(edgeR.DDS, blind = TRUE)
pcaData <- plotPCA(transform.edgeR.DDS, intgroup = c("group"), returnData = TRUE, ntop = 1000)  
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = group)) +
  geom_point(size = 5) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
  coord_fixed() +
  scale_color_manual(values = c("blue", "red"))

(3) DESeq2 analysis

# Prepare table of sample ID and experimental condition
sampleInfo <- data.frame(sampleID, group, stringsAsFactors = FALSE)

# Convert counts to DESeq2 object
ddsTxi <- DESeqDataSetFromTximport(txi.kallisto, colData = sampleInfo, design = ~group)

# Pre-filtering to remove genes with low counts
keep <- rowSums(counts(ddsTxi)) >= 10
ddsTxi <- ddsTxi[keep, ]

# Log-transform data for visualization
transform.ds <- rlog(deseq.ds, blind = TRUE)

# Plot PCA
pcaData <- plotPCA(transform.ds, intgroup = c("group"), returnData = TRUE, ntop = 1000)   
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = group)) +
  geom_point(size = 5) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
  coord_fixed() +
  scale_color_manual(values = c("blue", "red"))

# Plot MDS
library(DEFormats)
deseqMDS <- as.DGEList(ddsTxi)
plotMDS(deseqMDS, col = colors[groupPlot], pch = points[groupPlot])
legend("topright", legend = levels(groupPlot), pch = points, col = colors)

My question here is: (1) Which result should I trust / use? The MDS or PCA plots? (2) Are there any better ways to check for sample similarity / difference?

Thank you very much!

deseq2 edger pca mds • 4.7k views

ADD COMMENT • link updated 3.3 years ago by Mensur Dlakic ★ 27k • written 3.3 years ago by benjytan1488 • 0

score 1 · Answer 1 · 2020-12-18

First, while both PCA and MDS are dimensionality reduction techniques, the former is linear and the latter is non-linear. PCA transforms the data so that their relative distances are preserved in the embedding, while MDS ATEMPTS to preserve relative distances.

Second, i don't think you are transforming identical data, at least from what little R code I can read. There is a rlog function in one case but not the other, and I wonder if ntop = 1000 means that only the first 1000 data points are used for PCA. That would definitely make it a different comparison, especially if the number of genes is >>1000.

The code below illustrates my point. It makes 10000 random expression values in 5-250 range, and creates its matching dataset by randomly adding +/-5 noise points. Then it does the same thing one more time, thus creating two simulated positive and negative datasets. PCA will in most cases end up producing different results for all data (left panel) vs. only first 1000 data points (middle). MDS is different from both, but that is to be expected given the differences between the two transformations.

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import MDS
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
mds = MDS(n_components=2)

df = pd.DataFrame(np.random.uniform(low=5, high=250, size=10000), columns=['positive_1'])
df['positive_2'] = df['positive_1'].apply(lambda x: x + np.random.uniform(-5, 5))
df['negative_1'] = np.random.uniform(low=5, high=250, size=10000)
df['negative_2'] = df['negative_1'].apply(lambda x: x + np.random.uniform(-5, 5))
df[df.columns] = StandardScaler().fit_transform(df.values)
pca_df = pd.DataFrame(pca.fit_transform(df.transpose().values))
mds_df = pd.DataFrame(mds.fit_transform(df.transpose().values))
pca_df_part = pd.DataFrame(pca.fit_transform(df[:1000].transpose().values))
fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot(131)
pca_df.plot(kind='scatter', x=0, y=1, c=['r','r','b','b'], ax=ax, title='PCA')
ax = fig.add_subplot(132)
pca_df_part.plot(kind='scatter', x=0, y=1, c=['r','r','b','b'], ax=ax, title='PCA_1000')
ax = fig.add_subplot(133)
mds_df.plot(kind='scatter', x=0, y=1, c=['r','r','b','b'], ax=ax, title='MDS')
plt.tight_layout()
plt.savefig('1st-comparison.png', dpi=150)
plt.show()

enter image description here

score 0 · Answer 2 · 2020-12-18

0

Entering edit mode

3.3 years ago

swbarnes2 14k

From the looks of the plots, I'm sure 1 of my samples is an outlier or problematic,

I do not believe that's possible to conclude based on an n of 2 for each group. Unless the people who did the benchwork can point you to a clear QC issue, it would be irresponsible to omit any sample.

Did you give DESeq TPM data, and not raw counts? Are you sure you are supposed to do that?

ADD COMMENT • link 3.3 years ago by swbarnes2 14k

0

Entering edit mode

I do not believe that's possible to conclude based on an n of 2 for each group. Unless the people who did the benchwork can point you to a clear QC issue, it would be irresponsible to omit any sample.

I understand but the sample was prepared and sequenced by a previous member about 3 years ago and he has since left the lab. It will be difficult to determine a clear QC issue.

Did you give DESeq TPM data, and not raw counts? Are you sure you are supposed to do that?

I think so as I am just following the method showed here: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#DESeq2

ADD REPLY • link 3.3 years ago by benjytan1488 • 0