Gene Expression Analysis Steps ?
1
0
Entering edit mode
5 months ago
ali ▴ 20

Hi everybody, I'm new in this field. I'm trying to replicate a paper to train my self . The results come out pretty the same but not exactly the same, so I wanted to know if all my steps are right or if I'm missing something ( or even completely doing it wrong ). I did follow edgeR guide for help.

Those are the steps ( I jump not important code lines ) :

1 - Prior to computation of differentially expressed transcripts (‘ANOVA testing for differences’), genes with less than five (non-normalized) read counts in all samples were excluded from analyses.

counts <- counts[!apply(counts<5,1,all,na.rm=TRUE),]


2 - Next, data normalization factors were calculated that are incorporated in further analyses by the weighted trimmed mean of M-values (”TMM”) method.

list <- DGEList(counts=counts,group=group)
list <- calcNormFactors(list,method = "TMM")


3 - After fitting of negative binominal models and both common, tagwise and trended dispersion estimates were obtained, differentially expressed transcripts were determined using a generalized linear model (GLM) likelihood ratio test.

design <- model.matrix(~ 0 + group)
list <- estimateDisp(list,design,robust = TRUE)
fit <- glmQLFit(list,design,robust = TRUE)
single.contrast <- makeContrasts(Tumor  - HC, levels = design)
test <- glmQLFTest(fit,contrast = single.contrast)


4 - To minimize the effect of batch effect, resulting in possibly false positive results, ‘sequencing batch’ was included as a covariate in the GLM likelihood ratio test design matrix. DONT KNOW HOW TO DO

5 - To determine differentially expressed transcripts, we only focused on transcripts with expression levels of logarithmic counts per million (LogCPM) > 3.

tags <- topTags(test, n=nrow(test$table))$table
tags.logcpm <- tagsA[tags\$logCPM>3,]


At this point they obtain 5003 genes , I do obtain 4986 instead. But the logCPM values come out the same , just differences between FDR values (not so much tho).

Important Question : They do mention another thing that is the following -> To determine the specific input gene lists for the classifying algorithms we performed ANOVA testing for differences (as implemented in the R-package edgeR), yielding classifier-specific gene lists. Here they instead retrive 1072 genes that are actually inside my 4986 but I dont really understand what is this ANOVA testing for differences , and on what they did this (step 1? or after step 5?). This ANOVA testing is made to obtain genes for SVM classification so I assume they wanted genes with major differences.

geo expression gene r anova • 717 views
0
Entering edit mode

Kevin Blighe please review this post.

1
Entering edit mode

I am in the lengthy process of retiring from the field; so, my time for Biostars has now decreased to almost nil.

0
Entering edit mode

okay, thank you, Kevin Blighe I can understand. If possible, would you please recommend someone who can review that post for us? If anybody else could help with this then that would be great for us.

1
Entering edit mode

My good colleague ATpoint seems to have responded.

1
Entering edit mode

thank you so much

1
Entering edit mode
9 weeks ago
ATpoint 65k

If your results are highly similar and the main message is preserved then it's fine. Perfect reproducibility to the post-decimals is almost impossible unless you have identital input data, exact same software on and dependencies and the exact code they used. Please see the edgeR manual for a complete guide on the quasi likelihood pipeline. They additionally do filterByExpr but you seem to do something similar. Manual covers batch effects as well. In case you can fit it then it is just ~0+batch+group.

https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

0
Entering edit mode

thank you so much ATpoint you guys are gem for us on biostars