Hi all.
I've been searching this forum for an answer to my question, but I'm struggling to make sense of it as I do not have a statistical background.
My bulk RNA-seq data set consists of over 100 pig samples, from 5 tissues, across 2 trials, with 2 conditions (high residual feed intake and low residual feed intake)
For example:
replicates, tissue, trial, rfi
5, duodenum, 1, high
5, duodenum, 1, low
6, duodenum, 2, high
4, duodenum, 2, low
5, ileum, 1, high
6 ileum, 1, low
5 ileum, 2, high
3 ileum, 2, low
etc.... for 3 more tissues
First: I am treating trial carefully here as this is actually 2 data sets combined from 2 different experimental protocols.
- I am assuming some kind of batch affect correction is required? If that is correct can someone point me in the right direction of where to begin learning current best practice for identifying/removing batch effect in this case (i.e. a post/tutorial/blog/software etc).
- however, to complicate matters, I believe the lab would like to also know the difference caused by the change in experimental protocol, in which case, I would like to know if anyone thinks removing batch effect here is even appropriate?
Second, how can I calculate differential expression in such a complex design? I am familiar with DeSeq2 and EdgeR but only when analysing tissue v tissue, or condition v condition (i.e. pairwise comparison) but this is far more complex.
I have read some biostars posts advising for/against Anova, linear models, various DeSeq2/EdgeR settings but I am getting lost in the detail and I am unsure of whether the advice applies to this dataset.
Thanks in advance, Kenneth
Try GLM model in edgeR?
Thank you, I do appreciate the suggestion, but as I said... "various DeSeq2/EdgeR settings but I am getting lost in the detail and I am unsure of whether the advice applies to this dataset".... so I need a more detailed suggestion applied to this situation.