Question

DESeq2 and normalization

0

Entering edit mode

4 weeks ago

JH • 0

Hello. I'm currently studying RNA-seq on my own and have a part that I don't understand. Please let me know if I've violated any rules of this forum, and I'll take appropriate action. Since I lived in a non-English speaking country, please understand that my English may not be fluent.

First, I'm studying plants (rice). I obtained RNA-sequencing datas for a total of six samples, with three biological replicates each for Wild-type and mutant plant. I used Kallisto for mapping and quantification. After quantification, I obtained abundance.h5 and abundance.tsv files for each sample.

To extract the list of Differentially Expressed Genes (DEGs), I used the DESeq2 function in R. I used the abundance.h5 file for analysis. However, I'm not clear on the concept of normalization (I understand what normalization is and why it is necessary). I've read the DESeq() manual. I understand that DESeq() requires raw data, and since it contains an internal normalization model, I don't need to perform additional normalization.. But I'm unsure if that's true.

Do I need to perform additional normalization when using DESeq2? The tsv file contains both TPM and est_counts. If using abundance.h5 data, which data type should be input into DESeq(): TPM or est_counts?

Here is the R code I have used. Please let me know if there are any mistakes. What I aim to obtain is the list of DEGs (Differentially Expressed Genes) between the wild-type (WT) and mutant samples.

##
info <- read.csv('metadata.txt', sep='\t')

files <- paste('./s1', list.files(path='./s1', pattern = 'abundance.h5', recursive = TRUE), sep='/')

names(files) <- info$Sample.Name

tx2gene <- read.csv('ref.cdna.csv')

kallisto.h5 <- tximport::tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = TRUE)

mr = mr %>% mutate(Sample.Name = as.factor(Sample.Name)) 

dds <- DESeqDataSetFromTximport(kallisto.h5, info, ~Sample.Name)

dds$Sample.Name <- relevel(dds$Sample.Name, ref = 'WT')


#DESeq2
dds <- DESeq(dds)

res <- results(dds)

S1 <- results(dds, name="Sample.Name_MT_vs_WT")

S1d <- as.data.frame(S1)

S1f0.05 <- S1d[which(S1d$padj < 0.05 & abs(S1d$log2FoldChange) >= 2),]

write.csv(S1f0.05,file="deseqS1.csv")  #This is the file I need.


#Normalization #How do I utilize this file?

S1_normalized<-counts(dds, normalized=TRUE)

write.table(S1_normalized, "S1_norm.txt", sep="\t")

RNA-seq normalization DESeq2 • 377 views

ADD COMMENT • link updated 28 days ago by dsull ★ 5.8k • written 4 weeks ago by JH • 0

score 1 · Answer 1 · 2024-03-27

1

Entering edit mode

4 weeks ago

dsull ★ 5.8k

I understand that DESeq() requires raw data, and since it contains an internal normalization model, I don't need to perform additional normalization.

That is correct.

Do I need to perform additional normalization when using DESeq2?

No.

The tsv file contains both TPM and est_counts. If using abundance.h5 data, which data type should be input into DESeq(): TPM or est_counts?

Neither. You already have tximport importing all the relevant data from your abundance.h5 file (which is a binary representation of the tsv file).

I don't have time to give you feedback on your R code, but if you're getting the log2FoldChange and padj correctly outputted, that's all you need for your analysis

ADD COMMENT • link 4 weeks ago by dsull ★ 5.8k

0

Entering edit mode

Thank you for your response. I have often seen commands like dds <- estimateSizeFactors(dds) normalized_counts <- counts(dds, normalized=TRUE)

Why are these necessary?

Also, I've observed vsd_trans <- vst(dds, blind = TRUE) being used for PCA analysis and heatmap clustering.

Is additional normalization required for PCA analysis and heatmap clustering beyond the normalization model within DESeq2? I apologize for not being well-versed in RNA-seq. I'm studying it on my own, and I expect your response will be very helpful.

ADD REPLY • link 29 days ago by JH • 0

0

Entering edit mode

The DESeq command already does the necessary work that you'll need for your analysis (including normalization); it gets you your log2FoldChange and padj.

If you want to do other types of analyses beyond what the DESeq command offers -- like PCA and heatmap clustering, you'd manually extract the normalized_counts and use those. vst is a special way of doing a log2-transform (you can read up on it). You generally want to do downstream analysis on counts that have been "normalized" (so you can compare between samples) and log-transformed (otherwise your heatmap will look like crap; you can look up why we log-transform things).

Anyway, I hope that clears up everything -- do a search for additional questions :)

ADD REPLY • link 28 days ago by dsull ★ 5.8k