I am in the process of ironing out the details for data analysis of RNA-Seq data that comes from two different plant species that have each been given the same treatment, however I have an additional interaction in my experimental data that I would like to analyse as well, namely developmental stage.
I (will) have RNA-Seq data for two species A and B, control samples that were untreated (NT) and samples that were treated (T). The treatment is exactly the same for both species. Furthermore, samples were also taken at two different developmental stages for each the two species, seedling and juvenile.
After reading through the edgeR manual I think I have some idea of how my targets should be constructed.
Each sample type has 3 biological replicates, for a total of 24 samples (libraries) in total.
Sample Developmental Stage Treatment Organism Group 1 seedling NT A i 2 seedling NT A i 3 seedling NT A i 4 seedling NT B ii 5 seedling NT B ii 6 seedling NT B ii 7 seedling T A iii 8 seedling T A iii 9 seedling T A iii 10 seedling T B iv 11 seedling T B iv 12 seedling T B iv 13 juvenile NT A v 14 juvenile NT A v 15 juvenile NT A v 16 juvenile NT B vi 17 juvenile NT B vi 18 juvenile NT B vi 19 juvenile T A vii 20 juvenile T A vii 21 juvenile T A vii 22 juvenile T B viii 23 juvenile T B viii 24 juvenile T B viii
In order to compare orthologous genes between the two species, I used inparanoid to construct ortholog groupings of genes, using the published reference genomes for the two species. What I am thinking is that I would therefore perform alignment of my reads to the respective reference genome using a bowtie->tophat->htseq pipeline to derive the counts for a given gene and then using the inparanoid orthology mapping, assign the gene a common label so that orthologs gene expressions are compared. For example organism A has gene FOO and organism B has gene BAR, and they are orthologs of each other. So I would label each count to the FOO_BAR group for comparison of expression.
This would mean the count table would look something like:
Gene Group i Group ii Group iii Group iv Group v Group vi Group vii Group vii FOO_BAR 50 30 100 35 55 25 110 50 BAZ_BITER ... ... ... ... ... ... ... ... CAR_CDR ... ... ... ... ... ... ... ... CAT_TAC ... ... ... ... ... ... ... ...
I have only filled out a single row to show an example of an expected gene interaction in a pathway affected by treatment. Species A shows 2-fold change in expression without regard to developmental stage, whereas Species B only shows a 2-fold change once in the juvenile phase of its development. (I tried to keep the numbers simple, the actual data will no doubt not be as easily interpreted.)
According to known biological models regarding the effects of treatment, there should be a difference between the two organisms responses to treatment with respect to developmental age. However, I would like to thoroughly analyse the data for interactions between the experimental variables. I am having trouble visualizing/keeping track of the proper set-up of the interactions in order to conduct this series of analyses. And I'm relatively new to these types of analyses, so please feel free to let me know where I am making errors in my thought process.
I think that based upon the data and the possible interactions, I need to conduct several sets of interaction analyses. For example, I would like to identify the following:
- What genes are differentially expressed in the untreated state between species, irrespective of developmental stage?
- Something along the lines of ~species-stage ?
- How do I omit the treated state? Do I simply exclude those rows from the data and target matrix before constructing the interaction?
- What genes are differentially expressed between species with respect to treatment and developmental stage?
- What genes are differentially expressed within a species in response to treatment for a given developmental stage? (I believe I can answer this for each species separately, by restricting the data to the two conditions for a single species at a given stage of development and then perform a "vanilla" analysis. However, I am curious if it is possible to construct the interaction if I have all of the loaded data already?)
I was thinking this type of analysis is what would be classed as a factorial ANOVA? I'm just struggling to understand the proper way to set up the interaction to explore the data. I'm referencing Page 395 Michael J. Crawley's "The R book" to understand how to model formulae in R, but I'm not sure if the formula is properly representing the biological question I am asking. Most examples I find usually consist of 2 factors and my grasp of the material isn't strong enough yet to adequately see how this scales with 3 factors.
Accessing the literature on how conduct such an assay, I've come up short with any similar studies upon which I might be able to model the analyses. What I have found for cross-species RNA-Seq led me to look at inparanoid for constructing the ortholog mappings, but I'm stuck on how I accurately account for various interactions between the different factors.
Thank you for your time in reading this and for your assistance. If you have any further recommended reading I can access to help me wrap my head around multi-factor RNA-Seq analysis, it is greatly appreciated.