How to sum the genewise counts in edgeR ?
1
1
Entering edit mode
3.7 years ago
sunnykevin97 ▴ 980

HI

How to sum the gene-wise counts in edge R for differential genes comparing 2 groups. Looking for two-group comparison with n=2 in each group. I'm new to R literally struck. suggestion please.

The tabulate matrix looks some thing like this
    temp  Test Sample GeneI Gene2 Gene2 Gene3 ....Genen
    t80 T1  1   55
    t80 T1  2   89
    t80 T1  3   54
    t80 T1  4   453
    t80 T1  5   50
    t80 T1  6   32
    t80 T2  7   45
    t80 T2  8   45
    t80 T2  9   50
    t80 T2  10  54
    t80 T2  11  45
    t80 T2  12  15
    t8  T3  13  43
    t8  T3  14  25
    t8  T3  15  404
    t8  T3  16  385
    t8  T3  17  51
    t8  T3  18  32
    t8  T4  19  454
    t8  T4  20  395
    t8  T4  21  53
    t8  T4  22  35
    t8  T4  23  96
    t8  T4  24  87

     Code
    library(edgeR)
    matrix.dge <- DGEList(matrix)
    head(matrix.dge)

    sumgenes <- sumTechReps(matrix.dge,ID=colnames(matrix.dge))
    head(sumgenes)
    sumcounts <- sumgenes$counts
    nrow(sumcounts)
    head(sumcounts)
    tail(sumcounts)

    The result seems to be same after applying sumTechReps function
RNA-Seq r R • 1.7k views
ADD COMMENT
4
Entering edit mode
3.7 years ago

For RNA-seq you generally want biological replicates as opposed to technical replicates. That being said, there are a few things that you need to change.

Example data.

df <- structure(list(temp = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("t8", "t80"), class = "factor"), Test = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("T1", "T2", "T3", "T4"
), class = "factor"), Sample = 1:24, GeneI = c(55L, 89L, 54L,
453L, 50L, 32L, 45L, 45L, 50L, 54L, 45L, 15L, 43L, 25L, 404L,
385L, 51L, 32L, 454L, 395L, 53L, 35L, 96L, 87L), GeneII = c(204,
200, 44, 212, 299, 405, 79, 291, 303, 259, 207, 255, 99, 226,
206, 316, 80, 126, 131, 186, 478, 451, 323, 185)), row.names = c(NA,
-24L), class = "data.frame")

First, the matrix should be genes as rows, and samples as columns.

library("tidyverse")

mat <- df %>%
  select(-temp, -Test) %>%
  mutate(Sample = str_c("sample_", Sample)) %>%
  column_to_rownames("Sample") %>%
  t %>%
  as.matrix

> mat[, 1:5]
       sample_1 sample_2 sample_3 sample_4 sample_5
GeneI        55       89       54      453       50
GeneII      204      200       44      212      299

You then want a group data.frame.

groups <- df %>%
  select(temp, Test, Sample) %>%
  mutate(Sample = str_c("sample_", Sample)) %>%
  column_to_rownames("Sample")

> head(groups, 5)
         temp Test
sample_1  t80   T1
sample_2  t80   T1
sample_3  t80   T1
sample_4  t80   T1
sample_5  t80   T1

If you want to sum the technical replicates, You can then do so after creating the DGEList object using the above matrix and data.frame as inputs. I'm assuming Test denotes the technical replicates?

dge <- DGEList(mat, samples=groups)
dge <- sumTechReps(dge, dge$samples$Test)

> dge
An object of class "DGEList"
$counts
         T1   T2   T3   T4
GeneI   733  254  940 1120
GeneII 1364 1394 1053 1754

$samples
   group lib.size norm.factors temp Test
T1     1     2097            1  t80   T1
T2     1     1648            1  t80   T2
T3     1     1993            1   t8   T3
T4     1     2874            1   t8   T4

Alternatively, instead of summing the technical replicates, you can add it to the regression model to account for it when building the design matrix.

design <- model.matrix(~ temp + Test, groups)
ADD COMMENT
0
Entering edit mode

Rpolicastro - I love seeing your tidyverse-esque responses on here. I always end up learning something new from ya. Quick question though. Whenever you create a sample data.frame, you always use this framework:

structure(list(...... row.names = c(NA,
-24L), class = "data.frame")

Any reason on why you use this rather than using the tibble() function or even data.frame()?

ADD REPLY
2
Entering edit mode

Thanks, glad to hear my code snippets are helping people!

The format you see is the result of the dput function being run on a data.frame. It's a way to share a longer data.frame (or most other types of R objects) in a slightly more condensed format. I usually wouldn't use dput in my actual code, just as a way to share example data on some place like biostars or stack overflow.

ADD REPLY
0
Entering edit mode

Thanks for the informative code.

ADD REPLY

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6