I am using Voom to test a few hypotheses in my data. NB, my data is metagenomic reads of gut microbiome communities (rather than RNA where Voom is more commonly used for). We have a very large amount of data ( 126 pigs; 6 cohorts; each pig sampled about 6-8 times throughout a 6 weeks period, producing ca 830 gut microbiome samples).
Reads went through these steps: per subject (multiple time points) co-assembly (Megahit) —> binning (Metabat2) —> dereplication (dRep) —> unnormalised data into Voom. Going into Voom it' s the unnormalised depths (reads mapped against metagenomic bins) for each pig and time point, so instead of genes as rows and samples as columns, in my case we have metagenomic bins as rows and samples as columns.
I successfully ran Voom, for example, by testing the hypothesis: “The antibiotic treatment (in antibiotic group) causes some metagenomic bins to be increased/decreased between t=0 and t=2 compared to the control group at t=0 and t=2” . I ran this with the blocking-for-subject option because each pig belongs to one of the 6 groups and is sampled multiple times.
Q1: Do you think the way I am using Voom here is correct? Here a snippet of what is in the attached file:
Step 1. Filtering Subset to df containing rows with X number of NAs in cols. 80 is based on NCOL(ccc_sorted2) minus 10 . Plus, replace missing values with zeros
tokeep <- ccc_sorted2[rowSumsis.na(ccc_sorted2)) <= 80, ] tokeep[is.na(tokeep)] <- 0
Step 2. Prepare metadata
Step 3. Prepare the design matrix
Treat <- factor(paste(metadata$cohort,metadata$date,sep=".")) design <- model.matrix(~0+Treat) colnames(design) <- levels(Treat)
Step 4. Filtering
count_data <- count_data[ rowSums(cpm(count_data)>100) >= 5, ]
Step 5. VOOM:
count_data <- voom(count_data) count_data_2=as.matrix(count_data) corfit <- duplicateCorrelation(count_data_2,design,block=metadata$pig) corfit$consensus fit <- lmFit(count_data_2,design,block=metadata$pig,correlation=corfit$consensus) cm <- makeContrasts( N1vN0 = Neomycin.t2-Neomycin.t0, C1vC0 = Control.t2-Control.t0 N0vC0 = Neomycin.t0-Control.t0, N1vC1 = (Neomycin.t2-Neomycin.t0)-(Control.t2-Control.t0), levels = design) fit2 <- contrasts.fit(fit, cm) fit2 <- eBayes(fit2)
Step 6. Test hypotheses: hypothesis 1: neomycin has an effect (neomycin group vs Control before and after treatment)
H1 <- topTable(fit2, n = 50, coef="N1vC1", adjust.method = "BH", sort.by = "logFC", resort.by = "P") H1 <- subset(H1, adj.P.Val < 0.05)
Q2: I noticed that when I filter out rows containing an X number of NAs in step 1 (filtering) , the number of top genes I find changes considerably. What is the right thing to do in these cases? Should I keep both filtering steps ? The reason I have step 1. filtering is that I noticed I get DE genes that are DE at t=0 between the cohorts, although my contrast _should_ explicitly exclude those DE at t=0. Adding filtering step 1 solves seem to solve this problem, but I am not sure why.
Q3: If I want to run more contrasts, I should be able to just make the design matrix accordingly and add them to the contrasts, then get the top genes. However, for some (to me) obscure reason, when I add cohorts NeoD and NeoC to the data (before step 1), my top genes list that I obtain when I compare Neo and Control, changes and doesn’t make too much sense anymore. For instance, I get among the top genes, “5_1” which is differentially expressed at t=0 between the two groups and it shouldn’t come up in the contrast “N1vC1 = (Neomycin.t2-Neomycin.t0)-(Control.t2-Control.t0)”. This gene appears in the top gene list only when I add more cohorts datapoints to the dataset. What do you think I am doing wrong? What is the correct way to add more contrasts and filter accordingly? Should I instead each time make different subsets that go into step 1?
I tried to be as concise as possible. If I am omitting something important for you to take into account to elaborate a feedback, please don’t hesitate to ask me.
Thank you!! Dany