Question

Differences between published differential gene expression results and own analysis on RNA-seq data

0

Entering edit mode

5 hours ago

vernonlim98 • 0

Does anyone else experience their results of differential gene expression analysis being vastly different from what has been published? I am still very new to R programming and bioinformatics, and am just trying to find differentially expressed genes between platinum resistant and sensitive samples in the TCGA ovarian cancer dataset.

When I try to run either limma or DESeq2, I cannot seem to replicate the results that have been published by multiple papers, even when I use the same datasets and try to follow what has been done in their methodology, and their code doesn't seem to be publicly published. Objectively, I would trust published results over my own analysis, but when I try to do individual t-tests on the genes of interest, the results tend to lie closer to my own analysis compared to what has been published.

Anyone else facing the same issue, or have any possible insights? Any help will be greatly appreciated

R TCGA LIMMA • 112 views

ADD COMMENT • link updated 1 hour ago by dsull ★ 7.7k • written 5 hours ago by vernonlim98 • 0

score 1 · Answer 1 · 2025-10-02

1

Entering edit mode

4 hours ago

yura.grabovska ▴ 830

When you say vastly different, what do you mean exactly? Does your top 100 up/down not match theirs at all?

ADD COMMENT • link 4 hours ago by yura.grabovska ▴ 830

score 0 · Answer 2 · 2025-10-02

I second the comment about comparing the top 100 genes - if you see vastly different results, I would lean closer to examining protocols themselves and verifying there are no hidden errors.

If the top 100 show some similarity, the differences might be due to a number of reasons:

Software. take for example pseudoalignment methods (Salmon/Kallisto) vs spliced alignment (Hisat2/STAR). The former will bias the results more towards what is in your known/reference transcriptome and will sometimes force allocate extra reads into known genes (eg. multimapping reads from homologous gene not included in your reference)
Subset of the previous point - parameters matter a lot, and alas, are sometimes poorly documented. As you commented - if there are no scripts available to supplement the analysis, your guess for which versions and arguments were used might not be entirely accurate. For example, in Salmon - did they augment the index with decoy sequences or not? This single difference could amount to considerable difference in the results alone.
Preprocessing of the data - did you perform the same filtering of the data before performing the analysis as the authors?
Reference gene annotations can differ very (and I mean very) significantly not just from source to source (RefSeq vs Gencode) but from version to version (especially true for Ensembl and Gencode). If you are working with anything other than human/mouse/drosophila/thaliana - the annotations are probably even more divergent and using a version/source different from the original study will likely yield vastly different results. Tangentially, is the same version of the genome assembly being used? Many human studies (especially clinical) still rely on the outdated GRCh37 reference genome assembly.

There are many other reasons others might offer below, but these are the first few I would look into (especially genome annotations, since it often goes overlooked under the assumption that its a solved problem). There's a reason so many papers get published comparing different methods for the same computational task - often times there are considerable differences at all parts of the analysis and many go overlooked.