Question: Differential expressed gene analysis ---- to normalize the count data and fit into a model
0
gravatar for juncheng
6.5 years ago by
juncheng200
köln
juncheng200 wrote:

Hi, 

I have two corresponded count table, they are RNA-seq count data.

example, actual matrix much larger: 

Matrix A:
    3    3    6    2     0     0
   15   10    0    8     0     0
    0    0    0    0     0     0
    0    0    0    0     0     0
    3   14    2    2     2     7
Matrix B:
   29.0        31.5          27        29.5        0.0           0
   33.0        37.5          0         34.0        0.0           0
   0.0         0.0           0         0.0         0.0           0
   0.0         0.0           0         0.0         0.0           0
   24.5        22.5          26        23.5        23.5          24

The second matrix is a average of two integer count table, so it has *.5 numbers.

The two matrix are corresponding with each other, that means what in the first matrix is 0, it is 0 at the second.

The problem is, I want normalize the first matrix with the second one,  i.e, by element-wise division of first matrix with the second.

(1): the data are extremely skewed with a lot of '0's.

(2): the number in second matrix is general much larger than the first. After division, I end up with a matrix containing very tinny numbers and '0's. 

I want to find differentially expression genes across conditions. (each 3 columns are replicates). I'm now stacked with how to properly normalize and what kind of model to fit into a statistic test. 

Any help would be appreciated. 

Best

 

rna-seq R • 2.3k views
ADD COMMENTlink modified 6.5 years ago by Craig Henry20 • written 6.5 years ago by juncheng200
1

You can perform gene-wise normalization in limma, edgeR and DESeq2 (this is how conditional quantile normalization works), but the bigger questions are:

  1. What does matrix B actually represent? That is, where did the original count tables come from?
  2. What's the actual goal of the gene-wise normalization?

Answering #2 will also have answer the appropriate type of model to use.

ADD REPLYlink written 6.5 years ago by Devon Ryan98k

It actually hard to explain, my goal is quite specific. Short answer is the first count data is dependent on the second, as I'm interested in the part of variance from matrix A, which are independent of the matrix B, I would like to normalize the matrix A against B.

ADD REPLYlink written 6.5 years ago by juncheng200

OK, well then doing something like the following in DESeq2 may be the easiest:

normalizationFactors(dds) <- matrix_B/mean(matrix_B)

Note that dds is a DESeqDataSet object. You can do something similar in edgeR and limma.

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Devon Ryan98k

I should mention that using matrix B directly as I showed may be the opposite of what you want. This way will essentially bias estimates toward values with high matrix B values. If you instead want the exact opposite of this then you'll need to modify matrix B accordingly before using it (still divide it by its mean).

ADD REPLYlink written 6.5 years ago by Devon Ryan98k
1
gravatar for Craig Henry
6.5 years ago by
Craig Henry20
United States
Craig Henry20 wrote:

Your matrix A and B - are they environmental replicates? Integrate these two matrices separately into your analysis if so - try filtering instead. Below is a formula I use in edgeR to filter low-expressed genes across RNA-Seq count tables - then use the exact test to compare these groups. First, try using some code I from a very good Nature Protocols paper:

Anders, S. et al. (2013) Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc. 8, 1765–86

  library(edgeR)

  group <- factor(c(rep("your_group_1", 3) , rep("your_group_2", 3)))

  d <- DGEList(counts = counts, group = group, genes = genes)
  k <- rowSums(cpm(d)>1)>=3
  d <- d[k,]

  d<-estimateCommonDisp(d,verbose=TRUE)
  d<-calcNormFactors(d,method="TMM")
  d<-estimateTagwiseDisp(d,verbose=TRUE)
  d.est<-estimateSmoothing(d)

  et=exactTest(d,pair=c(1:2))

  tt=topTags(et,n=nrow(d))
  rn=rownames(tt$table)
  deg=rn[tt$table$FDR<0.05]

  plotMeanVar(d,show.tagwise.vars=TRUE,NBline=TRUE)
  plotBCV(d)
  plotSmear(d,de.tags=deg)

 

Cheers!

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Craig Henry20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1797 users visited in the last hour
_