Question

DESeq2: (1) Comparing multiple groups using a for loop & (2) normalization of time series data with housekeeping gene counts

0

Entering edit mode

4.5 years ago

AnonymousEngineer • 0

Hi Everyone:

I am trying to analyze a set of RNAseq data using DESeq2 to do a GO enrichment analysis. In my dataset have 8 groups. The groups are cond_09 cond_10 cond_11 cond_12 cond_13 cond_14 cond_15 cond_16 with 3 replicatesA, B & C' for each group.

The columns in my input TSV file are (GeneID cond_09_A cond_09_B cond_09_C cond_10_A cond_10_B ......... cond_16_A cond_16_B cond_16_C).

Right now, my goal is to find out compare all other groups against cond_09 and to calculate the fold change of the genes. I believe instead of doing it pairwise I should feed all the data for all the groups at once, and then estimate what I want. Here is my script I used to do it. I wanted to use the for loop to calculate the fold changes it for the pairs I want and then to store it in a unique TSV. I am getting the following couple of errors inside the for loop.

What am I doing wrong? I am brand new to R and I am finding it so confusing compared to MATLAB.

res <- results(dds, pAdjustMethod="BH", contrast=c("sample", caseA, caseB))
Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues, : 1 and cond_10 should be levels of sample such that sample_1_vs_cond_09 and sample_cond_10_vs_cond_09 are contained in 'resultsNames(object)'

write.table(data.frame(res), + file = 'output.tsv') sep='\t',
Error: unexpected ',' in "sep='\t'," row.names=F,
Error: unexpected ',' in "row.names=F," quote=F)
Error: unexpected ')' in "quote=F)"

#!/usr/bin/env Rscript  
library(DESeq2)
library(tidyverse)

#print("ensure that Infected = B and uninfected = A to get correct results")
df <- read.table('data_scaled_normalized_counts.tsv', header=T) 

samplename_df <- data.frame(samplename = names(df)[2:ncol(df)])
samplename_df$sample <- substr(samplename_df$samplename, 1,7)
samplename_df$batch <- substr(samplename_df$samplename, 9,9)
row.names(samplename_df) <- samplename_df$samplename
samplename_df$sample <- factor(samplename_df$sample)

count_matrix <- subset(df, select = -c(GeneID) )
row.names(count_matrix) <- df$GeneID
dds <- DESeqDataSetFromMatrix(countData = count_matrix[ , order(names(count_matrix))],
                              colData = samplename_df[order(samplename_df$sample),], 
                              design = ~ sample)
dds <- DESeq(dds)
#res <- results(dds, pAdjustMethod="BH", contrast=c("sample","cond_09","cond_10"))

samplelist <- unique(samplename_df$sample)
caseA <- samplelist[1]
for (j in samplelist[2:length(samplelist)]){
  caseB <- j
  #res <- results(dds, pAdjustMethod="BH", contrast=c("sample", 'cond_09', 'cond_10'))
  res <- results(dds, pAdjustMethod="BH", contrast=c("sample", caseA, caseB))
  print(res)
  res$GeneId = row.names(res)
  write.table(data.frame(res), 
              file = paste("output", caseB, "vs", caseA, ".tsv")
              sep='\t', 
              row.names=F,
              quote=F)
  resultsNames(dds)
}

Additionally, the conditions are timepoints of data collection in a time series experiment at which the samples were collected from tissues that were increasing in terms of dry mass. To account for increasing mass of the tissues, I was advised that I normalize all the reads at all timepoints with the reads of a specific housekeeping gene, as we don't have data on dry biomass of the collected samples. To convert to integer I multiplied each by 100000 and rounded off the reads. I have read in multiple threads that the reads should not be normalized (FPKM etc). Is there any issue with normalizing to read counts of a housekeeping gene, or normalizing to unit cell mass?

Thank you in advance!

Anby

rnaseq deseq deseq2 R • 3.3k views

ADD COMMENT • link updated 4.5 years ago by ATpoint 82k • written 4.5 years ago by AnonymousEngineer • 0

0

Entering edit mode

Change caseA definition to caseA <- as.character(samplelist[1]) and add a comma after file = paste("output", caseB, "vs", caseA, ".tsv").

ADD REPLY • link 4.5 years ago by jomo018 ▴ 720

score 0 · Answer 1 · 2019-10-17

I cannot answer 1) as I am not too familiar with DESeq2. In edgeR I would make a design with the reference sample as intercept and then loop through the respective contrasts you want. DESeq2 is probably similar, maybe check the coef argument in results which is simply an integer indicating the position of the respective contrast in the results object. Might be easier to access contrasts by this than the exact name. For the normalization: I would not do any of these strategies.

=> Cell mass is probably absolutely unreliable as you have no guarantee that this accurately reflects either RNA amounts, RNA quality / integrity (=absolute usable RNA amounts) or library composition.

=> Normalization to a single gene is probably not reliable as in any high-throughput experiment you cannot / should not rely on a single value / measurement as there is no guarantee it is representative. Also it is difficult to define true housekeeping genes. Priminent candidates such as GAPDH I've often seen being differentially-expressed in my data as it basically (probably) is a reflection of the cellular activity status. Better rely on many genes as reference for normalization in a data-driven manner. See below.

=> The normalization in DESeq2 assumes that most genes do not change between conditions. For details see here. If this holds true for your data (and it does for most RNA-seq experiments unless you are massively "messing up" the cells with critical knockouts or overexpression experiments) this should be what you do. It should compensate sequencing depth and library composition. Yes, avoid FPKM, this is not suitable for anything beyond visualization. Basically all benchmarking papers on that matter show inferior behaviour of pure by-million normalization technicales such as FPKM.

=> You could check by PCR if this dry mass could act as a batch effect, means higher mass samples generally cluster together. If so you could include that information into the design as a covariate. For PCA you could use plotPCA after running the raw data through vst.