Hi everybody, I'm new in this field. I'm trying to replicate a paper to train my self . The results come out pretty the same but not exactly the same, so I wanted to know if all my steps are right or if I'm missing something ( or even completely doing it wrong ). I did follow edgeR guide for help.
Those are the steps ( I jump not important code lines ) :
1 - Prior to computation of differentially expressed transcripts (‘ANOVA testing for differences’), genes with less than five (non-normalized) read counts in all samples were excluded from analyses.
counts <- counts[!apply(counts<5,1,all,na.rm=TRUE),]
2 - Next, data normalization factors were calculated that are incorporated in further analyses by the weighted trimmed mean of M-values (”TMM”) method.
list <- DGEList(counts=counts,group=group) list <- calcNormFactors(list,method = "TMM")
3 - After fitting of negative binominal models and both common, tagwise and trended dispersion estimates were obtained, differentially expressed transcripts were determined using a generalized linear model (GLM) likelihood ratio test.
design <- model.matrix(~ 0 + group) list <- estimateDisp(list,design,robust = TRUE) fit <- glmQLFit(list,design,robust = TRUE) single.contrast <- makeContrasts(Tumor - HC, levels = design) test <- glmQLFTest(fit,contrast = single.contrast)
4 - To minimize the effect of batch effect, resulting in possibly false positive results, ‘sequencing batch’ was included as a covariate in the GLM likelihood ratio test design matrix. DONT KNOW HOW TO DO
5 - To determine differentially expressed transcripts, we only focused on transcripts with expression levels of logarithmic counts per million (LogCPM) > 3.
tags <- topTags(test, n=nrow(test$table))$table tags.logcpm <- tagsA[tags$logCPM>3,]
At this point they obtain 5003 genes , I do obtain 4986 instead. But the logCPM values come out the same , just differences between FDR values (not so much tho).
Important Question : They do mention another thing that is the following -> To determine the specific input gene lists for the classifying algorithms we performed ANOVA testing for differences (as implemented in the R-package edgeR), yielding classifier-specific gene lists. Here they instead retrive 1072 genes that are actually inside my 4986 but I dont really understand what is this ANOVA testing for differences , and on what they did this (step 1? or after step 5?). This ANOVA testing is made to obtain genes for SVM classification so I assume they wanted genes with major differences.