Question

Differential expression analysis - issue with replicating results

2

Entering edit mode

2.4 years ago

suzanne rein ▴ 10

Hi Biostars!

I'm fairly new to bioinformatics (self-taught for my MSc) and I'm trying to replicate a study using publicly available transcriptomic data (GSE107934). I'm struggling to get the same results as the authors. I'm following the study's methods and using DESeq2 to conduct the differential expression analysis. Most of the DEGs I find are the same as the authors, but the logFC and adjusted P-values are different. The order of DEGs ranked by logFC is also different, which makes me think I'm doing something wrong. I presume the differences may due to a few factors, and I'm wondering if anyone could please provide some insight.

My ideas:

For annotations, the authors used the GRCh37.p13 assembly - I'm having a hard time making sure I'm accessing this exact same assembly. This could be a reason for the discrepancies? I've tried using the "ensemble versions" available at the time of the study, but these results were still not the same.
The authors might have used a calculation(s) not made explicitly clear in their methods?
My design is incorrect? Although this seems to be the only model matrix R lets me create without getting a model matrix is not full rank error.

The study: https://journals.physiology.org/doi/full/10.1152/japplphysiol.00014.2018

The goal is to see the differentially expressed genes between exercise types [aerobic exercise (AE) and resistance exercise (RE)] and time points (baseline, 1hr post-exercise, and 4hrs post-exercise) for 6 participants. Please note that participant #4 is missing a single time point (AE_4).

ANY help at ALL would be incredibly appreciated.

Here is my code, starting from creating the groups.

treatment_dson <- c("Baseline", "AE_1", "AE_4", "RE_1","RE_4")
rep_treat_dson <- c(rep("Baseline",1), rep("AE_1",1), rep("AE_4",1), rep("RE_1",1),rep("RE_4",1),
                rep("Baseline",1), rep("AE_1",1), rep("AE_4",1), rep("RE_1",1),rep("RE_4",1),
                rep("Baseline",1), rep("AE_1",1), rep("AE_4",1), rep("RE_1",1),rep("RE_4",1),
                rep("Baseline",1), rep("AE_1",1), rep("RE_1",1),rep("RE_4",1),
                rep("Baseline",1), rep("AE_1",1), rep("AE_4",1), rep("RE_1",1),rep("RE_4",1),
                rep("Baseline",1), rep("AE_1",1), rep("AE_4",1), rep("RE_1",1),rep("RE_4",1))

factor_treat <- factor(rep_treat_dson)  

pheno_dson_clean$GROUP <- factor_treat
pheno_dson_clean$GROUP <- relevel(pheno_dson_clean$GROUP, ref="Baseline")

Creating the DESeq object:

dds_dson <- DESeqDataSetFromMatrix(countData = cts_dson,
                               colData = pheno_dson_clean,
                               design = ~ GROUP)
keep <- rowSums(counts(dds_dson)) > 1
dds <- dds_dson[keep,]

Differential expression analysis:

dds <- DESeq(dds)
res <- results(dds)

Differential expression analysis for AE, baseline to 4hrs post-exercise:

resultsNames(dds)
res.1 <- results(dds, name="GROUP_AE_4_vs_Baseline")
filter_df <- as.data.frame(res.1)

Function to filter results based on author's cut-offs:

my_results = function(DESeqResults_class){
  df = as.data.frame(DESeqResults_class)
filter_padj = df[df$padj < 0.05,]
filter_logFC = filter_padj[abs(filter_padj$log2FoldChange) > 0.58,]
clean_df = subset(filter_logFC, !is.na(baseMean))
r2c = rownames_to_column(clean_df, var = 'ensgene')
  return(r2c)
}

Function for annotations:

ensembl104 = useEnsembl(biomart="ensembl", version=104)      
ensembl104 = useDataset("hsapiens_gene_ensembl", 
                    mart=ensembl104)

my_annotations = function(r2c){
x = getBM(attributes=c('ensembl_gene_id',
                     'external_gene_name',
                     'ensembl_gene_id_version',
                     'ensembl_transcript_id',
                     'ensembl_transcript_id_version',
                     'start_position','end_position',
                     'strand','transcript_start',
                     'transcript_end',
                     'gene_biotype',
                     'go_id',
                     'transcript_length',
                     'transcript_version',
                     'description',
                     'chromosome_name'), 
        filters = c('ensembl_gene_id'), 
        values = filter_df$ensgene,
        mart = ensembl104)
  y = left_join(filter_df_R2C,x,by=c('ensgene'='ensembl_gene_id'))
  return(y)
}

x = my_annotations(filter_df_R2C)

Write results into csv file:

write_csv(x, file = "/Users/Documents/AE_b_vs_4.csv",
      append = FALSE)

Again, I would be eternally grateful for absolutely any advice... Thank you very much!

rnaseq DESeq2 R bioconductor • 1.3k views

ADD COMMENT • link updated 2.4 years ago by ATpoint 82k • written 2.4 years ago by suzanne rein ▴ 10

3

Entering edit mode

Did they provide the counts or are you generating them yourself? If the latter, that could be the issue.

Exact replication should not be necessary so long as the end conclusions are reproducible from the results you acquire. Any paper worth its weight will return the same conclusions when its data is analyzed by an orthogonal, high-quality method.

ADD REPLY • link 2.4 years ago by jared.andrews07 ★ 16k

2

Entering edit mode

I would even say that exact replication is close to impossible unless exact input data (count matrix), R and package versions, exact code and (if this applies) the random seeds are provided. Without that there will always be slight variations in outputs, and be it postdecimal values which then (when using hard cutoffs such as padj < 0.05) might throw a few genes above/below the threshold on one but not the other machine you are running it on, plus of course differences in tool defaults etc. Check whether the conclusions are the same, e.g. whether enriched pathways are identical, or use fGSEA to see whether their gene sets (=their DEGs) are highly significant in your data.

If you are creating the counts yourself then it's even more difficult because you would need the exact genome version, the exact annotation (e.g. NCBI vs GENCODE, there are thousands of genes included in one but not another, plus the version of the annotation, e.g. GENCODE v30/31/32...) plus exact same aligner/quantifier versions and code...you see...things do add up to be difficult for exact replication.

ADD REPLY • link 2.4 years ago by ATpoint 82k

2

Entering edit mode

DESeq2 had some changes in defaults that could cause sizeable changes in results (e.g., see New function lfcShrink() in DESeq2 or RNA seq analysis using Deseq2 and apeglm lfc shrinkage). You should try to get the same version (both R and BioConductor) as the ones used in the paper, in case they stated the versions used.

ADD REPLY • link 2.4 years ago by h.mon 35k

0

Entering edit mode

Thank you very much for your reply. This makes sense. In case they did not state the versions used (which, they did not), do you recommend estimating which version was used based on the year of publication? Or, would it be wiser to use updated packages and explain potential reasons for discrepancies? I sincerely appreciate your help.

ADD REPLY • link 2.4 years ago by suzanne rein ▴ 10

2

Entering edit mode

You could use publication year as an upper bound, but oftentimes there is an interval of a couple of years between finishing the analysis and publishing the manuscript - the versions of the tools and databases used could be a lot older than your estimate.

Why are you trying to replicate the analysis? Is it a learning experience? If so, yes, it me be useful to repeat the analysis with older versions of the tools and databases. You will learn how it is difficult to replicate published analyses, and, hopefully, you will adopt better practices (e.g., providing version numbers and parameters used).

If you want to extract further biological insight from this data, just make sure your analysis is correct, and do not bother with getting the exact same results as the original analysis.

ADD REPLY • link 2.4 years ago by h.mon 35k