Hi everyone,

I have been stuck a few days on this problem, so I thought I'd better ask. This is not just any edgeR question as the input data is different than usual (non-RNA). So if anyone knows of a better tool to run this analysis please tell me, otherwise just assume the data is RNA counts data.

Data: unnormalized counts (derived from depth i.e. mapped reads against a metagenomic bin). A metagenomic bin corresponds to an organism (bacterium species) so obviously its abundance is correlated with the abundance of other species within the same sample, but it may not be relevant to the question here.

Data structure: (every subject belongs to either C or T group and is sampled twice at t=0 and t=1)

```
df <- data.frame(timepoint = rep_len(0:1, length.out=16),
subjectID = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8),
group = c(rep("C", 8), rep("T", 8)))
```

Aim: find differentially abundant metagenomic bins between the two cohorts (C vs T) i.e. which genomes go up/down between t=0 and t=1 in a different way between the two cohorts?

My code so far:

```
#clean dataframe by selecting rows with only a certain amount of NA values
input_df <- df[rowSums( is.na(df ) <=80, ]
#replace missing values with zeros
input_df[is.na(input_df)] <- 0
```

Prepare the count data (convert countdata first column (genomes) to rownames):

```
count_data <- data.frame(input_df[,-1], row.names=input_df[,1], check.names = FALSE)
```

Run voom:

```
count_data_2 <- voom(count_data)
```

Prepare design matrix: (I am creating 4 groups: C_t0, C_t1, T_t0, T_t1)

```
Treat <- factor(paste(metadata$cohort,metadata$date,sep="."))
design <- model.matrix(~0+Treat)
colnames(design) <- levels(Treat)
```

Block on subject: (gives me warnings)

```
corfit <- duplicateCorrelation(count_data_2,design,block=metadata$subject)
corfit$consensus
```

Fit and make contrasts:

```
fit <- lmFit(count_data_2,design,block=metadata$subject,correlation=corfit$consensus)
cm <- makeContrasts(
T1vT0 = Treatment.1-Treatment.0,
C1vC0 = Control.2-Control.1,
T0vC0 = Treatment.1-Control.1,
T1vC1 = (Treatment.1-Treatment.0)-(Control.2-Control.1),
levels = design)
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)
top20 <- topTable(fit2, n = 20, coef="T1vC1", adjust.method = "BH")
```

Q: Is this setup/my code correct? Can it be improved? Do I get in this way what I am after?

I also tried edgeR (instead of limma/voom) and I was getting seemingly more real p-values (as in: closer to p=0.005, whereas now with limma/voom I am getting suspiciously tiny p-values).

Thank you so much for your help!

Kind Regards Dany

Thank you Fabio. I remember reading a paper that compared metagenomeseq to edgeR among others and metagenomeseq would not perform really well: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2386-y

Did you use metegenomeseq when you had two groups to compare, each with => 2 timepoints?

Sorry, not yet (only single time-point). I will do it shortly, and post developments if I find something that might be of help.