Question

Cross-species RNA-Seq analysis using edgeR with multiple factors?

0

Entering edit mode

8.4 years ago

dan.shea ▴ 10

Hello,

I am in the process of ironing out the details for data analysis of RNA-Seq data that comes from two different plant species that have each been given the same treatment, however I have an additional interaction in my experimental data that I would like to analyse as well, namely developmental stage.

I (will) have RNA-Seq data for two species A and B, control samples that were untreated (NT) and samples that were treated (T). The treatment is exactly the same for both species. Furthermore, samples were also taken at two different developmental stages for each the two species, seedling and juvenile.

After reading through the edgeR manual I think I have some idea of how my targets should be constructed.

Each sample type has 3 biological replicates, for a total of 24 samples (libraries) in total.

Sample  Developmental Stage  Treatment  Organism  Group
1       seedling             NT         A         i
2       seedling             NT         A         i
3       seedling             NT         A         i
4       seedling             NT         B         ii
5       seedling             NT         B         ii
6       seedling             NT         B         ii
7       seedling             T          A         iii
8       seedling             T          A         iii
9       seedling             T          A         iii
10      seedling             T          B         iv
11      seedling             T          B         iv
12      seedling             T          B         iv
13      juvenile             NT         A         v
14      juvenile             NT         A         v
15      juvenile             NT         A         v
16      juvenile             NT         B         vi
17      juvenile             NT         B         vi
18      juvenile             NT         B         vi
19      juvenile             T          A         vii
20      juvenile             T          A         vii
21      juvenile             T          A         vii
22      juvenile             T          B         viii
23      juvenile             T          B         viii
24      juvenile             T          B         viii

In order to compare orthologous genes between the two species, I used inparanoid to construct ortholog groupings of genes, using the published reference genomes for the two species. What I am thinking is that I would therefore perform alignment of my reads to the respective reference genome using a bowtie->tophat->htseq pipeline to derive the counts for a given gene and then using the inparanoid orthology mapping, assign the gene a common label so that orthologs gene expressions are compared. For example organism A has gene FOO and organism B has gene BAR, and they are orthologs of each other. So I would label each count to the FOO_BAR group for comparison of expression.

This would mean the count table would look something like:

Gene        Group i   Group ii   Group iii   Group iv   Group v   Group vi   Group vii   Group vii
FOO_BAR     50        30         100         35         55        25         110         50
BAZ_BITER   ...       ...        ...         ...        ...       ...        ...         ...
CAR_CDR     ...       ...        ...         ...        ...       ...        ...         ...
CAT_TAC     ...       ...        ...         ...        ...       ...        ...         ...

I have only filled out a single row to show an example of an expected gene interaction in a pathway affected by treatment. Species A shows 2-fold change in expression without regard to developmental stage, whereas Species B only shows a 2-fold change once in the juvenile phase of its development. (I tried to keep the numbers simple, the actual data will no doubt not be as easily interpreted.)

According to known biological models regarding the effects of treatment, there should be a difference between the two organisms responses to treatment with respect to developmental age. However, I would like to thoroughly analyse the data for interactions between the experimental variables. I am having trouble visualizing/keeping track of the proper set-up of the interactions in order to conduct this series of analyses. And I'm relatively new to these types of analyses, so please feel free to let me know where I am making errors in my thought process.

I think that based upon the data and the possible interactions, I need to conduct several sets of interaction analyses. For example, I would like to identify the following:

What genes are differentially expressed in the untreated state between species, irrespective of developmental stage?
- Something along the lines of ~species-stage ?
- How do I omit the treated state? Do I simply exclude those rows from the data and target matrix before constructing the interaction?
What genes are differentially expressed between species with respect to treatment and developmental stage?
- ~speciesstagetreatment?
What genes are differentially expressed within a species in response to treatment for a given developmental stage? (I believe I can answer this for each species separately, by restricting the data to the two conditions for a single species at a given stage of development and then perform a "vanilla" analysis. However, I am curious if it is possible to construct the interaction if I have all of the loaded data already?)

I was thinking this type of analysis is what would be classed as a factorial ANOVA? I'm just struggling to understand the proper way to set up the interaction to explore the data. I'm referencing Page 395 Michael J. Crawley's "The R book" to understand how to model formulae in R, but I'm not sure if the formula is properly representing the biological question I am asking. Most examples I find usually consist of 2 factors and my grasp of the material isn't strong enough yet to adequately see how this scales with 3 factors.

Accessing the literature on how conduct such an assay, I've come up short with any similar studies upon which I might be able to model the analyses. What I have found for cross-species RNA-Seq led me to look at inparanoid for constructing the ortholog mappings, but I'm stuck on how I accurately account for various interactions between the different factors.

Thank you for your time in reading this and for your assistance. If you have any further recommended reading I can access to help me wrap my head around multi-factor RNA-Seq analysis, it is greatly appreciated.

RNA-Seq R edgeR • 2.9k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by dan.shea ▴ 10

0

Entering edit mode

edit: 2015-12-08

I've re-read the entire edgeR manual and the chapter in Crawley's The R Book on generalized linear models. I think I see that the best way to structure the design matrix is as I outlined above, but playing with makeContrasts I now see that to answer some of the biological questions that I have, is best served by setting up the GLM such that each of the 8 groups is a coefficient in the model matrix. e.g. model.matrix(~0+groups)

Then I can construct the contrasts with makeContrasts and apply them to the fit of the GLM.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by dan.shea ▴ 10