Question

Plotting of genes using tpm matrix

0

Entering edit mode

4 months ago

rajdeepboral00 ▴ 70

I have a tpm matix of my samples from RNA-Seq data, now I want to plot the expression of a particular gene Lets say X. Should i directly plot it using the tpm values or any kind of scaling is necessary? Also, which test should i imply to check for the statistical significance?

Thankyou in advance.

normalization TPM boxplot • 750 views

ADD COMMENT • link updated 4 months ago by i.sudbery 22k • written 4 months ago by rajdeepboral00 ▴ 70

2

Entering edit mode

To add to what @ATPoint was saying:

TPM is fine for comparing the expression of two genes within a single condition (or at least, is probably the best thing we have). I'd probably recommend using log TPM for this type of thing. Its not entirely clear what the correct distribution for log TPM is, but t-tests probably work well enough in this situation.

TPM is not good for comparing the expression of a gene across two or more conditions. For this use proper count modeling with DESeq2/edgeR/limma-voom.

ADD REPLY • link 4 months ago by i.sudbery 22k

0

Entering edit mode

# Load required libraries
library(DESeq2)
library(dplyr)
library(tibble)
library(ggplot2)
library(ggpubr)  # For stat_compare_means()

# ----------------------------------------------------------
# STEP 1: Get Ensembl ID for Hif1a
# ----------------------------------------------------------
gene_id <- norm_counts_annotated %>%
  filter(external_gene_name == "Hif1a") %>%
  pull(EnsemblID_clean) %>%
  unique()

# ----------------------------------------------------------
# STEP 2: Extract normalized counts
# ----------------------------------------------------------
norm_counts <- counts(dds_all, normalized = TRUE)

# ----------------------------------------------------------
# STEP 3: Build dataframe
# ----------------------------------------------------------
df <- data.frame(
  Sample = colnames(norm_counts),
  Expression = as.numeric(norm_counts[gene_id, ]),
  Condition = colData(dds_all)$Condition,
  stringsAsFactors = FALSE
)

# Add region information based on sample name
df$RegionCode <- sub("\\d+", "", df$Sample)

region_map <- c("CC" = "Cervical", "CT" = "Thoracic", "CL" = "Lumbar",
                "IC" = "Cervical", "IT" = "Thoracic", "IL" = "Lumbar")
condition_map <- c("CC" = "Control", "CT" = "Control", "CL" = "Control",
                   "IC" = "Infected", "IT" = "Infected", "IL" = "Infected")

df$Region <- region_map[df$RegionCode]
df$Condition <- condition_map[df$RegionCode]

# Handle zero values for log scale
df$Expression[df$Expression == 0] <- NA

# ----------------------------------------------------------
# STEP 4: Plot with boxplots + p-values per region
# ----------------------------------------------------------
ggplot(df, aes(x = Condition, y = Expression, fill = Condition)) +
  geom_boxplot(outlier.shape = NA, width = 0.5, color = "black") +
  geom_jitter(width = 0.15, size = 2.5, alpha = 0.8) +
  facet_wrap(~ Region, scales = "free_y") +
  scale_y_continuous(trans = "log10") +
  stat_compare_means(
    method = "wilcox.test",
    label = "p.signif",
    comparisons = list(c("Control", "Infected")),
    label.y.npc = "top",
    size = 5
  ) +
  theme_classic() +
  labs(
    title = "Hif1a Expression by Region (Control vs Infected)",
    x = "Condition",
    y = "Normalized Count (log10)"
  ) +
  scale_fill_manual(values = c("Control" = "steelblue", "Infected" = "firebrick")) +
  theme(
    strip.text = element_text(face = "bold", size = 12),
    axis.text.x = element_text(angle = 0),
    axis.title = element_text(size = 12)
  )

I just wanted to know if this is a correct way to plot the expression of my particular gene of interest. I have run deseq2 and took the normalized counts of the deseq2 itself to plot for my gene of interest. Moreover Is the y-axis appropriate?

enter image description here

ADD REPLY • link updated 4 months ago by i.sudbery 22k • written 4 months ago by rajdeepboral00 ▴ 70

1

Entering edit mode

Yes, the y-axis is reasonable. ALthough the test for differences should come from DESeq, rather than from a wilcox in this case. With four replicates in each condition, I'd say its more or less impossible for there to be a significant diffrence via a wilcox test.

ADD REPLY • link 4 months ago by i.sudbery 22k

1

Entering edit mode

Scaling is only necessary if you have genes with different expression levels, so you probably can go ahead. For differential expression of RNA-seq there are packages such as limma, edgeR and DESeq2. Starting from TPM is not good, asked many times before, please search for it.

ADD REPLY • link 4 months ago by ATpoint 90k