Question: How to sum the genewise counts in edgeR ?
0
7 weeks ago by
sunnykevin97140
sunnykevin97140 wrote:

HI

How to sum the gene-wise counts in edge R for differential genes comparing 2 groups. Looking for two-group comparison with n=2 in each group. I'm new to R literally struck. suggestion please.

``````The tabulate matrix looks some thing like this
temp  Test Sample GeneI Gene2 Gene2 Gene3 ....Genen
t80 T1  1   55
t80 T1  2   89
t80 T1  3   54
t80 T1  4   453
t80 T1  5   50
t80 T1  6   32
t80 T2  7   45
t80 T2  8   45
t80 T2  9   50
t80 T2  10  54
t80 T2  11  45
t80 T2  12  15
t8  T3  13  43
t8  T3  14  25
t8  T3  15  404
t8  T3  16  385
t8  T3  17  51
t8  T3  18  32
t8  T4  19  454
t8  T4  20  395
t8  T4  21  53
t8  T4  22  35
t8  T4  23  96
t8  T4  24  87

Code
library(edgeR)
matrix.dge <- DGEList(matrix)

sumgenes <- sumTechReps(matrix.dge,ID=colnames(matrix.dge))
sumcounts <- sumgenes\$counts
nrow(sumcounts)
tail(sumcounts)

The result seems to be same after applying sumTechReps function
``````
rna-seq R • 219 views
modified 7 weeks ago by rpolicastro1.7k • written 7 weeks ago by sunnykevin97140
3
7 weeks ago by
rpolicastro1.7k
rpolicastro1.7k wrote:

For RNA-seq you generally want biological replicates as opposed to technical replicates. That being said, there are a few things that you need to change.

Example data.

``````df <- structure(list(temp = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("t8", "t80"), class = "factor"), Test = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("T1", "T2", "T3", "T4"
), class = "factor"), Sample = 1:24, GeneI = c(55L, 89L, 54L,
453L, 50L, 32L, 45L, 45L, 50L, 54L, 45L, 15L, 43L, 25L, 404L,
385L, 51L, 32L, 454L, 395L, 53L, 35L, 96L, 87L), GeneII = c(204,
200, 44, 212, 299, 405, 79, 291, 303, 259, 207, 255, 99, 226,
206, 316, 80, 126, 131, 186, 478, 451, 323, 185)), row.names = c(NA,
-24L), class = "data.frame")
``````

First, the matrix should be genes as rows, and samples as columns.

``````library("tidyverse")

mat <- df %>%
select(-temp, -Test) %>%
mutate(Sample = str_c("sample_", Sample)) %>%
column_to_rownames("Sample") %>%
t %>%
as.matrix

> mat[, 1:5]
sample_1 sample_2 sample_3 sample_4 sample_5
GeneI        55       89       54      453       50
GeneII      204      200       44      212      299
``````

You then want a group data.frame.

``````groups <- df %>%
select(temp, Test, Sample) %>%
mutate(Sample = str_c("sample_", Sample)) %>%
column_to_rownames("Sample")

temp Test
sample_1  t80   T1
sample_2  t80   T1
sample_3  t80   T1
sample_4  t80   T1
sample_5  t80   T1
``````

If you want to sum the technical replicates, You can then do so after creating the DGEList object using the above matrix and data.frame as inputs. I'm assuming Test denotes the technical replicates?

``````dge <- DGEList(mat, samples=groups)
dge <- sumTechReps(dge, dge\$samples\$Test)

> dge
An object of class "DGEList"
\$counts
T1   T2   T3   T4
GeneI   733  254  940 1120
GeneII 1364 1394 1053 1754

\$samples
group lib.size norm.factors temp Test
T1     1     2097            1  t80   T1
T2     1     1648            1  t80   T2
T3     1     1993            1   t8   T3
T4     1     2874            1   t8   T4
``````

Alternatively, instead of summing the technical replicates, you can add it to the regression model to account for it when building the design matrix.

``````design <- model.matrix(~ temp + Test, groups)
``````

Rpolicastro - I love seeing your tidyverse-esque responses on here. I always end up learning something new from ya. Quick question though. Whenever you create a sample data.frame, you always use this framework:

``````structure(list(...... row.names = c(NA,
-24L), class = "data.frame")
``````

Any reason on why you use this rather than using the tibble() function or even data.frame()?

2

Thanks, glad to hear my code snippets are helping people!

The format you see is the result of the `dput` function being run on a data.frame. It's a way to share a longer data.frame (or most other types of R objects) in a slightly more condensed format. I usually wouldn't use dput in my actual code, just as a way to share example data on some place like biostars or stack overflow.