Question: "Studying R" Why don't I find the same results of the articles?
0
gravatar for Leite
2.1 years ago by
Leite930
São Paulo - Brazil - Unifesp
Leite930 wrote:

Hello everyone,

I'm studying R and I like to repeat published works, like this: Bioinformatics analysis of gene expression profile data to screen key genes involved in pulmonary sarcoidosis.

but in these protein-protein interactions works they do not give much informations abous how was analyzed, they say only the genes with the | log2fold change (FC) | > 1 and FDR <0.05 were considered the SDRs.

This paper used this GSE16538 from Crouser ED, Culver DA, Knox KS, Julian MW et al. Gene expression profiling identifies MMP-12 and ADAMDEC1 as potential pathogenic mediators of pulmonary sarcoidosis. Am J Respir Crit Care Med 2009 May 15;179(10):929-38

My questions are:

  1. These articles (PPI network) usually don't repeat the analyzes with the same parameters of the original paper, why?
  2. Because they don1t use the same parameters they find differentresults and this influences the networks, so I should or not repeat the parameters of the original paper?
  3. Now, in relation to my studies with R, even using the same cutoff parameters, I often ddon't find the same numbers of genes of the article, should I be doing something wrong? or does this happen even because of the few information about the R code used?

Best regards,

Leite

gene expression array genome • 950 views
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Leite930
3
gravatar for Kevin Blighe
2.1 years ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

Hi Leite,

I would not feel disappointed about not being capable of reproducing the results. The lack of reproducibility in research has long been known but is only now gaining focus:

We owe it to the very public and private institutions that fund us to do a better job in quality control of our results. My personal worry is that we permit 'error' (generally defined here) to enter our research far too easy, and this error then becomes propagated over years to become part of our very conclusions and beliefs of how disease mechanisms work, which has ultimately contributed to the lack of reproducibility that we now see. As a perfect example, NGS is being pushed into clinical practice right now but it has higher error rates than Sanger sequencing, and the analysis methods are neither ready for use in clinical practice.

Organisations like The Cochrane Collaboration (I'm a member of the Breast Cancer Group) have, I believe, helped to open the door to this revelation, i.e., that published research is non-reproducible.


  1. These articles (PPI network) usually don't repeat the analyzes with the same parameters of the original paper, why?
  2. Because they don1t use the same parameters they find differentresults and this influences the networks, so I should or not repeat the parameters of the original paper?

These would be questions best directed to the authors of the study. If one re-analyses data, though, one will typically have a different hypothesis than the original authors and, thus, may choose different parameters in order to produce results that best support this hypothesis. Network methods can be highly variable and slight alterations in, for example, fold change and FDR Q value can have a bg impact on the resulting network that's constructed.

I have to just draw your attention to the conclusion of the first manuscript that you cited:

CONCLUSION: CXCL9, CXCL11, STAT1, CCL5 and GBP1 might be implicated in pulmonary sarcoidosis through interacting with each other.

...note the word 'might'.

ADD COMMENTlink written 2.1 years ago by Kevin Blighe52k

Hey Kevin Blighe,

Thank you very much for your reply,

What worries me about the different results found is exactly the lack of reproducibility, because each one will use a specific code in the R with specific packages which are often not mentioned in the methods.

Between the example I quote the two articles use the same data and get different results (in numbers of genes), I reproduced the two tests and got different results (in number of genes). But one thing that might be equal between their findings are the functions and pathways of them can be generally the same.

As I am studying R, was this doubt, is "normal" this difference of results?

Original article method: log2fold change (FC) | > 1 and FDR <0.05 were considered the SDRs.

In this paper they found 208 genes, using my code I found 288.

My R code: #Install the core bioconductor packages, if not already installed

source("http://bioconductor.org/biocLite.R")

biocLite()

#Install additional bioconductor libraries, if not already installed
>biocLite("GEOquery")
>biocLite("affy")
>biocLite("gcrma")
>biocLite("affyPLM")
>biocLite("RColorBrewer")
>biocLite("limma")
>biocLite("hgu133plus2.db")
>biocLite("annotate")
>biocLite("simpleaffy")

#Load the necessary libraries
>library(GEOquery)
>library(affy)
>library(gcrma)
>library(affyPLM)
>library(RColorBrewer)
>library(hgu133plus2.db)
>library(annotate)
>library(simpleaffy)

#Downloading the file
>getGEOSuppFiles("**MyData**")

#Before starting to work, you have to unzip the files.
>untar("**MyData**/**MyData**_RAW.tar", exdir="data")
>cels <- list.files("data/", pattern = "[gz]")
>sapply(paste("data", cels, sep="/"), gunzip)
>cels

#Loading Files
#Phenodata must have names have to be the same as file names
>celfiles <- read.affy(covdesc="phenodata.txt", path="data")

#Normalising the data
>celfiles.rma <- rma(celfiles)

#Checking the quality
#Load colour libraries
>library(RColorBrewer)

#Set colour palette
>cols <- brewer.pal(8, "Set1")

#Plot a boxplot of unnormalised intensity values
>boxplot(celfiles, col=cols)

#Now a normalised boxplot
>boxplot(celfiles.rma, col=cols)

#Plot a density vs log intensity histogram for the unnormalised data
>hist(celfiles, col=cols)

#Plot a density vs log intensity histogram for the normalised data
>hist(celfiles.rma, col=cols)

#The first step of the analysis is to filter non-informative data
>celfiles.filtered <- nsFilter(celfiles.rma, require.entrez=FALSE, remove.dupEntrez=FALSE)

#Find differentially expressed probes
>samples <- celfiles.rma$Target

#Convert into factors
>samples <- as.factor(samples)

# Check factors have been assigned
>samples

#Set up the experimental design
>design <- model.matrix(~0 + samples)
>colnames(design) <- c("disease", "control")

#Inspect the experiment design
>design

#At this point we have normalized filtered data and a description of the data and the samples and the experimental design.
#Fit the linear model to the filtered expression set
>fit <- lmFit(exprs(celfiles.filtered$eset), design)

#Set up a contrast matrix to compare disease v control
>contrast.matrix <- makeContrasts(control-disease, levels=design)

#Now, the contrast matrix is combined with the linear fit model by set of probes.
>fit.con <- contrasts.fit(fit, contrast.matrix)
>fit.eb <- eBayes(fit.con)
>probeset.list <- topTable(fit.eb, coef=1, number=50000,  adjust.method="BH",sort.by="P", lfc=1)

#Annotating the results with associated gene symbols
>gene.symbols <- getSYMBOL(rownames(probeset.list), "hgu133plus2")
>results <- cbind(probeset.list, gene.symbols)
>head(results)

#To save results
> write.table(results, "results.txt", sep="\t", quote=FALSE)

PS: I manually filter the FDR <0.05 in the results table.

Best regards

Leite

ADD REPLYlink written 2.1 years ago by Leite930
1

Hi Leite, thanks - I moved your post.

I don't see anything incorrect in your code for limma, apart from the line contrast.matrix <- makeContrasts(control-disease, levels=design). The order you're comparing control and disease here will just mean that positive fold-changes will represent genes higher in controls, and negative fold-changes will represent genes higher in disease. You can easily switch this around, i.e., disease-control, in order to get the opposite. This will neither affect the overall number of statistically significant genes.


As I delve further into the methods, it becomes more clear why results differ from the initial study at least. Here are the methods used from the initial study by Crouser and colleagues:

Statistical analysis comparing the sarcoidosis and control groups was performed using BRB Array Tools software developed by the Biometric Research Branch of the U.S. National Cancer Institute (http://linus.nci.nih.gov/BRB-ArrayTools). Array data were excluded if less than 20% of expression data had at least a twofold change in either direction from the gene's median value; 6,533 probe sets passed the filtering criteria. To identify differentially expressed genes between the two groups, a modified t test with random variance model was applied. The random variance model is an improvement over the standard t test because it permits sharing information among genes about within-class variation without assuming that all genes have the same variance (17). To adjust for multiple test comparisons, the Benjamin and Hochberg false discovery rate, which controls the proportion of genes expected in a list of genes by chance, was applied. Fold change between the two groups was estimated by the ratio of geometric means of intensities in each group. By first applying a t-distribution to obtain 95% confidence intervals for the difference in means of logarithmic intensities and then using an exponential transformation of these intervals, 95% confidence intervals for fold changes were obtained. A gene was identified as differentially expressed when the P value was < 0.005, the false discovery rate was <0.03, and the fold increase/decrease was ≥2.

Gene Network Analysis

Functional and network analyses of gene expression data derived from the Affymetrix HG-U133 Plus 2 chips was performed using Ingenuity Systems' IPA software (Ingenuity Systems Inc., Redwood, CA), which is described in greater detail in the Discussion. Briefly, gene expression changes are considered in the context of physical, transcriptional, or enzymatic interactions of the gene/gene products and are grouped according to interacting gene networks through the use of Ingenuity Pathways Analysis. The score assigned to any given gene network takes into account the total number of molecules in the data set, the size of the network, and the number of “network eligible” genes/molecules in the data set. The network score is based on the hypergeometric distribution and is calculated with the right-tailed Fisher's exact test. The network score is the negative log of this P value.

So, they used completely different tests that have different assumptions than limma. They also do very specific filtering of genes prior to conducting the differential expression analysis. So, it's no surprise that results differ slighly from this study.


For the study by Li and colleagues, I can only see that they mention the use of limma in the abstract. The abstract, being just an abstract, doesn't give specific information on the exact methods. If you have access to the journal and can download the PDF, I encourage you to look to see exactly what the authors did when it came to their use of limma.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Kevin Blighe52k

Hey Kevin,

Thank you for your attention,

For the study by Li:

2.2 Data preprocessing and DEGs screening

The raw data were preprocessed by utilizing affy package in R (available at http://www.bioconductor.org/packages/release/bioc/html/affy.html) (Gautier et al., 2004). Briefly, the processes of data preprocessing included background correction, quantile normalization and probe summarization. For multiple probes mapped to one gene, their average value was considered as the gene expression value. After preprocessing, total 54674 probes corresponding to 43284 genes were obtained. Then, limma package (available at http://www.bioconductor.org/packages/release/bioc/html/limma.html) (Diboun et al., 2006) in R was used to identify the DEGs between pulmonary sarcoidosis samples and normal lung samples. The t-test method was utilized to calculate the p-values of genes. Then, Benjamini & Hochberg’s method (Benjamini and Hochberg, 1995) was used to calculate the adjusted p-values (that was false discovery rate, FDR). Only the genes with the |log2fold change (FC)| >1 and FDR < 0.05 were considered as DEGs.

This is the method he describes, compared to my code what's wrong?

When I use disease-control rather than control-disease, I find numbers equal to mine, but with inverted values when compared to those of work.

In paper Li said CCL5 is up reguled,

CCL5 = 1.455779 in Li paper

CCL5 = -1.433074703 with my R code control-disease

CCL5 = 1.433074703 with my R code disease-control

Now I'm very confuse.

Again, Thank you so much!

Leite

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Leite930
1

Hey Leite,

Looks like a very standard limma analysis that they did, but they mention a few things that are indeed vague. To start, I would have these questions for you:

  • Can you confirm that you also have 54674 probes?
  • Where do you summarise your probe values by the mean in order to get expression values over genes?

At the end of their methods, they just state that they conducted a t-test and then adjusted by BH. This for me directly implies that they literally just used the t.test() function in r and then adjusted with p.adjust(), and that they calculated fold-changes manually by dividing and then logging. Limma just fits a linear model to your data (akin to lm()) and the essentially it is indeed a t-test that's conducted to compare your conditions of interest. So, this is not clear what exactly they did, but one could assume that they did indeed just use limma and are merely explaining the limma method in brief in their methods

One key difference that I note is that you adjust your statistics by the empirical Bayes method, whereas they make no mention to this. That may contribute to the difference.

In general, you're probably now understanding better why research is so lacking in reproducibility, but this is a global problem that affects everyone at all levels.

ADD REPLYlink written 2.1 years ago by Kevin Blighe52k

Hey Kevin,

Can you confirm that you also have 54674 probes?

No, dont have 54674, only 54612 probes.

Where do you summarise your probe values by the mean in order to get expression values over genes?

It looks like this is missing from my code, How can I add this?

How can I add t.test () in my code, should I remove eBayes ()?

ADD REPLYlink written 2.1 years ago by Leite930

If I understood correctly, there is no correct and fixed way of doing analysis of genetic expression, each researcher does the way he thinks is correct.

ADD REPLYlink written 2.1 years ago by Leite930
2

That's pretty much how it goes, yes. Well, 'she' or 'he', in relation to researchers. There are many very skilled female and male bioinformaticians / computational biologists.

In relation to this thread, it looks like the array is a relatively modern Affymetrix array design, thus, you should be able to mean-summarise over genes with the rma() function, with something like:

#Summarise over genes
project.bgcorrect.norm.avg <- rma(project, background=TRUE, normalize=TRUE, target="core")
#Get probe-level values
project.bgcorrect.norm.avg.Exons <- rma(project, background=TRUE, normalize=TRUE, target="probeset")

Give that a try.


Also, yes, try with and without eBayes and see how it affects the results. The last time that I checked, it won't affect the most statistically significantly differentially expressed transcripts - only those at the frontier of statistical significance will be affected.

I would not use t.test() - I believe that they used the standard limma tests but that they just wanted to elaborate a bit further on how limma works.

The discrepancy in starting probe numbers may be due to control probes. Control probe IDs on the Affymetrix platforms generally start with 'AFFX' or 'affx', so, search for those in the celfiles.rma object that you create.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Kevin Blighe52k

Dear Kevin,

Thank you very much, your help was of great importance for my studies.

ADD REPLYlink written 2.1 years ago by Leite930
1

No problem my friend! Good luck

ADD REPLYlink written 2.1 years ago by Kevin Blighe52k

Hey Kevin,

I sent to you a email, I dont know if you has received it.

ADD REPLYlink written 2.1 years ago by Leite930

Olá, yes, I will get back to you shortly!

ADD REPLYlink written 2.1 years ago by Kevin Blighe52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1915 users visited in the last hour