DESeq2 and normalization
1
0
Entering edit mode
4 weeks ago
JH • 0

Hello. I'm currently studying RNA-seq on my own and have a part that I don't understand. Please let me know if I've violated any rules of this forum, and I'll take appropriate action. Since I lived in a non-English speaking country, please understand that my English may not be fluent.

First, I'm studying plants (rice). I obtained RNA-sequencing datas for a total of six samples, with three biological replicates each for Wild-type and mutant plant. I used Kallisto for mapping and quantification. After quantification, I obtained abundance.h5 and abundance.tsv files for each sample.

To extract the list of Differentially Expressed Genes (DEGs), I used the DESeq2 function in R. I used the abundance.h5 file for analysis. However, I'm not clear on the concept of normalization (I understand what normalization is and why it is necessary). I've read the DESeq() manual. I understand that DESeq() requires raw data, and since it contains an internal normalization model, I don't need to perform additional normalization.. But I'm unsure if that's true.

Do I need to perform additional normalization when using DESeq2? The tsv file contains both TPM and est_counts. If using abundance.h5 data, which data type should be input into DESeq(): TPM or est_counts?

Here is the R code I have used. Please let me know if there are any mistakes. What I aim to obtain is the list of DEGs (Differentially Expressed Genes) between the wild-type (WT) and mutant samples.

##
info <- read.csv('metadata.txt', sep='\t')

files <- paste('./s1', list.files(path='./s1', pattern = 'abundance.h5', recursive = TRUE), sep='/')

names(files) <- info$Sample.Name

tx2gene <- read.csv('ref.cdna.csv')

kallisto.h5 <- tximport::tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = TRUE)

mr = mr %>% mutate(Sample.Name = as.factor(Sample.Name)) 

dds <- DESeqDataSetFromTximport(kallisto.h5, info, ~Sample.Name)

dds$Sample.Name <- relevel(dds$Sample.Name, ref = 'WT')


#DESeq2
dds <- DESeq(dds)

res <- results(dds)

S1 <- results(dds, name="Sample.Name_MT_vs_WT")

S1d <- as.data.frame(S1)

S1f0.05 <- S1d[which(S1d$padj < 0.05 & abs(S1d$log2FoldChange) >= 2),]

write.csv(S1f0.05,file="deseqS1.csv")  #This is the file I need.


#Normalization #How do I utilize this file?

S1_normalized<-counts(dds, normalized=TRUE)

write.table(S1_normalized, "S1_norm.txt", sep="\t")
RNA-seq normalization DESeq2 • 377 views
ADD COMMENT
1
Entering edit mode
4 weeks ago
dsull ★ 5.8k

I understand that DESeq() requires raw data, and since it contains an internal normalization model, I don't need to perform additional normalization.

That is correct.

Do I need to perform additional normalization when using DESeq2?

No.

The tsv file contains both TPM and est_counts. If using abundance.h5 data, which data type should be input into DESeq(): TPM or est_counts?

Neither. You already have tximport importing all the relevant data from your abundance.h5 file (which is a binary representation of the tsv file).

I don't have time to give you feedback on your R code, but if you're getting the log2FoldChange and padj correctly outputted, that's all you need for your analysis

ADD COMMENT
0
Entering edit mode

Thank you for your response. I have often seen commands like dds <- estimateSizeFactors(dds) normalized_counts <- counts(dds, normalized=TRUE)

Why are these necessary?

Also, I've observed vsd_trans <- vst(dds, blind = TRUE) being used for PCA analysis and heatmap clustering.

Is additional normalization required for PCA analysis and heatmap clustering beyond the normalization model within DESeq2? I apologize for not being well-versed in RNA-seq. I'm studying it on my own, and I expect your response will be very helpful.

ADD REPLY
0
Entering edit mode

The DESeq command already does the necessary work that you'll need for your analysis (including normalization); it gets you your log2FoldChange and padj.

If you want to do other types of analyses beyond what the DESeq command offers -- like PCA and heatmap clustering, you'd manually extract the normalized_counts and use those. vst is a special way of doing a log2-transform (you can read up on it). You generally want to do downstream analysis on counts that have been "normalized" (so you can compare between samples) and log-transformed (otherwise your heatmap will look like crap; you can look up why we log-transform things).

Anyway, I hope that clears up everything -- do a search for additional questions :)

ADD REPLY

Login before adding your answer.

Traffic: 1607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6