WGCNA input genes - greatest variance?
0
0
Entering edit mode
3 months ago
fluentin44 • 0

Hi all,

I'm doing WGCNA with RNASeq expression data - ive seen examples of supplying the (vst normalised) count matrix to WGCNA in its totality and some instance of those genes with the greatest variance, i.e. the top 25% or top 5000 genes. When I do the former the scale free model plot looks normal with a clear soft power of 8. When I do the latter the graphs look bizarre as below. Is this as a result of picking 5000 genes in the dataset with greatest variance?

# Removal of missing genes / outlier samples ------------------------------

gsg <- goodSamplesGenes(t(datExpr)) # filters samples and genes with too many missing entries
summary(gsg)
gsg$allOK # Indicates whether there are genes or samples which are outliers in boolean

table(gsg$goodGenes)
table(gsg$goodSamples)

# remove genes that are detected as outliers
datExpr <- datExpr[gsg$goodGenes == TRUE,]

# if (!gsg$allOK)
# {
#   # Optionally, print the gene and sample names that were removed:
#   if (sum(!gsg$goodGenes)>0) 
#     printFlush(paste("Removing genes:", paste(names(datExpr)[!gsg$goodGenes], collapse = ", ")));
#   if (sum(!gsg$goodSamples)>0) 
#     printFlush(paste("Removing samples:", paste(rownames(datExpr)[!gsg$goodSamples], collapse = ", ")));
#   # Remove the offending genes and samples from the data:
#   datExpr = datExpr[gsg$goodSamples, gsg$goodGenes]
# }



# Explicitly clustering and removing samples ------------------------------

sampleTree = hclust(dist(t(datExpr)), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, 
     cex.axis = 1.5, cex.main = 2)


samples.to.be.excluded <- c('BA5_02')
datExpr <- datExpr[,!(colnames(datExpr) %in% samples.to.be.excluded)]

sampleTree = hclust(dist(t(datExpr)), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, 
     main = "Sample clustering to detect outliers sample removed", 
     sub="", xlab="", 
     cex.lab = 1.5, 
     cex.axis = 1.5, 
     cex.main = 2)



# Make deseq2 object and exclude low count --------------------------------

# # Plot a line to show the cut
# abline(h = 200000, col = "red");
# # Determine cluster under the line
# clust = cutreeStatic(sampleTree, cutHeight = 200000, minSize = 10)
# table(clust)
# # clust 1 contains the samples we want to keep.
# keepSamples = (clust==1)
# datExpr = data[keepSamples, ]
# nGenes = ncol(datExpr)
# nSamples = nrow(datExpr)


# Making deseq2 matrix and removing lower count genes ---------------------

datTraits <- 
  hamster_pheno %>% 
  filter(!row.names(.) %in% samples.to.be.excluded)

names(datTraits)

all(rownames(datTraits) %in% colnames(datExpr))
all(rownames(datTraits) == colnames(datExpr)) # Same order

# data <- datExpr 

dds <-
  DESeqDataSetFromMatrix(countData = datExpr,
                         colData = datTraits,
                         design = ~ 1) # not specifying model


## remove all genes with counts < 15 in more than 75% of samples (31*0.75=23.25)
## suggested by WGCNA on RNAseq FAQ

sample_no <- ncol(datExpr)
sample_threshold <- sample_no * 0.75
dds75 <- dds[rowSums(counts(dds) >= 15) >= sample_threshold,]
nrow(dds75) # 13284 genes

# perform variance stabilization
dds_norm <- vst(dds75)

sampleTree = hclust(dist(t(assay(dds_norm))), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, 
     main = "Sample clustering of normalised data to detect outliers sample removed", 
     sub="", xlab="", 
     cex.lab = 1.5, 
     cex.axis = 1.5, 
     cex.main = 2)


# get normalized counts - Rows samples, columns gene IDs
datExpr <-
  assay(dds_norm) %>%
  t()

# Calculate variance for each gene
gene_variance <- apply(assay(dds_norm), 1, var)

# Option 1: Select top 25% of genes by variance
top_variance_genes <- names(sort(gene_variance, decreasing = TRUE))[1:(length(gene_variance) * 0.25)]

# # Option 2: Select top 5000 genes by variance (if less than total number of genes)
top_variance_genes_5000 <- names(sort(gene_variance, decreasing = TRUE))[1:min(5000, length(gene_variance))]
#
# # Choose one of the options based on your criteria
# # For example, if using top 25%:
selected_genes <- top_variance_genes_5000

# Filter datExpr to include only selected genes
datExpr_filtered <- datExpr[selected_genes, ]

datExpr <- datExpr_filtered %>% t()

collectGarbage()


# Binarise catagorical data -----------------------------------------------

temp <- datTraits
temp_2 <- 
  temp %>% 
  binarizeCategoricalColumns(c("group", "sex"))

# datTraits_bin <- 
#   bind_cols(temp, temp_2) %>% 
#   dplyr::select(-ab)
datTraits_bin <- temp_2

sampleTree2 <- hclust(dist(datExpr), method = "average")
traitColors <- numbers2colors(datTraits_bin)

# Check samples are clustering together
plotDendroAndColors(sampleTree2,
                    traitColors,
                    groupLabels = colnames(datTraits_bin),
                    main = "Sample dendrogram and trait heatmap normalised")



# Pick soft power threshold -----------------------------------------------

# Choose a set of soft-thresholding powers
power <- c(c(1:10), seq(from = 12, to = 50, by = 2))

# Call the network topology analysis function
sft <- pickSoftThreshold(datExpr,
                         powerVector = power,
                         networkType = "signed",
                         verbose = 5)


sft.data <- sft$fitIndices

# visualization to pick power

a1 <- 
  ggplot(sft.data, aes(Power, SFT.R.sq, label = Power)) +
  geom_point() +
  geom_text(nudge_y = 0.1) +
  geom_hline(yintercept = 0.8, color = 'red') +
  labs(x = 'Power', y = 'Scale free topology model fit, signed R^2') +
  theme_classic()


a2 <- 
  ggplot(sft.data, aes(Power, mean.k., label = Power)) +
  geom_point() +
  geom_text(nudge_y = 0.1) +
  labs(x = 'Power', y = 'Mean Connectivity') +
  theme_classic()

grid.arrange(a1, a2, nrow = 2) # Selected 18 as its above threshold and low mean connectivity

enter image description here

WGCNA RNA-Seq • 489 views
ADD COMMENT
0
Entering edit mode

When I do the latter the graphs look bizarre as below. Is this as a result of picking 5000 genes in the dataset with greatest variance?

by subsetting the top 25% genes with the greatest variance you are likely selecting subset of DEG whose expression profile is affected by a strong driver of variation. If you run a PCA on the vst transformed expression data of the 5000 genes you will likely see 2 cluster of samples

ADD REPLY
0
Entering edit mode

No looked the same as previous as the DESeq2::plotpca function takes the rows with the greatest variance anyway.

I just dont really understand why I have this odd bounce from 1-3, 3-10 and back up again without meeting the 0.8 threshold.

ADD REPLY
0
Entering edit mode

Can you post the PCA plot?

ADD REPLY

Login before adding your answer.

Traffic: 2689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6