Differential expressed gene analysis ---- to normalize the count data and fit into a model
1
0
Entering edit mode
7.8 years ago
juncheng ▴ 200

Hi,

I have two corresponded count table, they are RNA-seq count data.

Example, actual matrix much larger:

Matrix A:

    3    3    6    2     0     0
15   10    0    8     0     0
0    0    0    0     0     0
0    0    0    0     0     0
3   14    2    2     2     7


Matrix B:

   29.0        31.5          27        29.5        0.0           0
33.0        37.5          0         34.0        0.0           0
0.0         0.0           0         0.0         0.0           0
0.0         0.0           0         0.0         0.0           0
24.5        22.5          26        23.5        23.5          24


The second matrix is a average of two integer count table, so it has *.5 numbers.

The two matrix are corresponding with each other, that means what in the first matrix is 0, it is 0 at the second.

The problem is, I want normalize the first matrix with the second one, i.e, by element-wise division of first matrix with the second.

1. the data are extremely skewed with a lot of '0's.
2. the number in second matrix is general much larger than the first. After division, I end up with a matrix containing very tinny numbers and '0's.

I want to find differentially expression genes across conditions. (each 3 columns are replicates). I'm now stacked with how to properly normalize and what kind of model to fit into a statistic test.

Any help would be appreciated.

Best

R RNA-Seq • 2.5k views
1
Entering edit mode

You can perform gene-wise normalization in limma, edgeR and DESeq2 (this is how conditional quantile normalization works), but the bigger questions are:

1. What does matrix B actually represent? That is, where did the original count tables come from?
2. What's the actual goal of the gene-wise normalization?

Answering #2 will also have answer the appropriate type of model to use.

0
Entering edit mode

It actually hard to explain, my goal is quite specific. Short answer is the first count data is dependent on the second, as I'm interested in the part of variance from matrix A, which are independent of the matrix B, I would like to normalize the matrix A against B.

0
Entering edit mode

OK, well then doing something like the following in DESeq2 may be the easiest:

normalizationFactors(dds) <- matrix_B/mean(matrix_B)

Note that dds is a DESeqDataSet object. You can do something similar in edgeR and limma.

0
Entering edit mode

I should mention that using matrix B directly as I showed may be the opposite of what you want. This way will essentially bias estimates toward values with high matrix B values. If you instead want the exact opposite of this then you'll need to modify matrix B accordingly before using it (still divide it by its mean).

1
Entering edit mode
7.8 years ago
Craig Henry ▴ 20

Your matrix A and B - are they environmental replicates? Integrate these two matrices separately into your analysis if so - try filtering instead. Below is a formula I use in edgeR to filter low-expressed genes across RNA-Seq count tables - then use the exact test to compare these groups. First, try using some code I from a very good Nature Protocols paper:

Anders, S. et al. (2013) Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc. 8, 1765–86

library(edgeR)

group <- factor(c(rep("your_group_1", 3) , rep("your_group_2", 3)))

d <- DGEList(counts = counts, group = group, genes = genes)
k <- rowSums(cpm(d)>1)>=3
d <- d[k,]

d<-estimateCommonDisp(d,verbose=TRUE)
d<-calcNormFactors(d,method="TMM")
d<-estimateTagwiseDisp(d,verbose=TRUE)
d.est<-estimateSmoothing(d)

et=exactTest(d,pair=c(1:2))

tt=topTags(et,n=nrow(d))
rn=rownames(tt$table) deg=rn[tt$table\$FDR<0.05]

plotMeanVar(d,show.tagwise.vars=TRUE,NBline=TRUE)
plotBCV(d)
plotSmear(d,de.tags=deg)


Cheers!