Question

RNAseq meta-analysis to identify “consistently expressed” genes

2

Entering edit mode

5 weeks ago

cgibbsm ▴ 20

Hi all,

I am performing an RNAseq meta-analysis, using multiple publicly available RNAseq datasets from NCBI (same species, different conditions).

My goal is to identify genes that are expressed - at least moderately - in all conditions.

Current Approach:

Normalisation: I've normalised the raw gene counts to Transcripts Per Million (TPM) to account for sequencing depth and gene length differences across samples.
Expression Thresholding: For each sample, I calculated the lower quartile of TPM values. A gene is considered "expressed" in a sample if its TPM exceeds this threshold.
Consistent Expression Criteria: Genes that are expressed (as defined above) in every sample across all datasets are classified as "consistently expressed."

Key Points:

I'm not interested in differential expression analysis, as most of the datasets I'm using lack appropriate control conditions.
I'm also not focusing on identifying “stably expressed” genes based on variance statistics – eg identification of housekeeping genes
Most RNA-seq meta-analysis methods that I’ve read about so far, rely on differential expression or variance-based approaches, which don't align with my needs.
There seems to be a lack of standardised methods for identifying consistently expressed genes without differential analysis. OR maybe I am over complicating it??

My questions:

Can anyone tell me if my current approach is appropriate/robust/publishable?
Are there other established methods or best practices for identifying consistently expressed genes across multiple RNA-seq datasets without relying on differential expression or variance analysis?

Any advice hugely appreciated, TIA

meta-analysis method rnaseq • 878 views

ADD COMMENT • link updated 4 weeks ago by mbyvcm ▴ 480 • written 5 weeks ago by cgibbsm ▴ 20

0

Entering edit mode

There seems to be a lack of standardised methods for identifying consistently expressed genes without differential analysis. OR maybe I am over complicating it??

Yes, there is probably none because the entire definition of "expressed" is not standardized or robust. Slight changes in the cutoff you use can probably have quite some influence, and the cutoff is entirely arbitrary. Needless to say that detection in RNA-seq is a combination of true gene expression and the RNA-seq library prep kit. Some procedures might favor some over other genes, detection levels could be entirely technical. Maybe you just plot gene vs "detected in n samples" and define a cutoff there. Or, or you really want some meta-analysis style you define some sort of a Z-score. Like, relative deviation of the TPM from the TPM cutoff. Positive values means higher than the cutoff, lower means below cutoff. That gives for every dataset a ranking from "best detected" to "worst detected" gene, and you could aggregate these lists into a meta-ranking using RobuatRankAggreg. Not saying that this is good or reliable, but it's at least a more formal method than any "gene must be above cutoff in x out of y samples". This returns a score per gene and you can plot the wscore distribution, selecting cutoffs e.g. with sort sort of inflexion-point based method. Again cutoff is arbitrary, so essentially it's just kicking the problem can down the road as arbitrary thresholds were already the problem in the per-dataset definition of "expressed genes".

Don't move this to answer, it doesn't qualify for it.

ADD REPLY • link 5 weeks ago by ATpoint 88k

0

Entering edit mode

Thanks for your reply - I can see your point about my definition of expression being inaccurate. That leads me to re-frame my objective, instead identify the genes detected in all samples.

You're right in that the LQ threshold is entirely arbitrary, just a placeholder for a better idea, my attempt at excluding non-expressed/lowly expressed genes. For context, I am trying to identify a gene (and enzyme) unique to a single species of bacteria. I know its expressed in all conditions studied, so I am analysing the available RNAseq datasets in order to exclude genes which are either non-expressed or only expressed in response to environmental conditions.

I had not formally addressed background/shot noise in my analysis (forgive me, im new to this). So I will start there, then move onto the gene vs "detected in n samples" plot and see how it shapes up.

Thanks again for your input.

ADD REPLY • link 5 weeks ago by cgibbsm ▴ 20

0

Entering edit mode

Just adding on, the problem really is that you cannot say if a gene is not expressed just because it's not detected. I think your approach sounds acceptable, but I may suggest still performing the TMM or RLE library size normalization, if you plan to use a common TPM threshold. I think this would help since if there's a strong outlier gene expression it could affect library size, but less so using the fancier library size scaling.

If you have data of housekeeping/reference genes that you know shouldn't change between conditions, I may even favor that path, since to me it's a more robust reference point. At least showing average TPM of a set of housekeeping/reference genes could be useful too.

ADD REPLY • link 5 weeks ago by rfran010 ★ 1.6k

score 0 · Answer 1 · 2025-05-27

0

Entering edit mode

4 weeks ago

mbyvcm ▴ 480

You may find the normalisation approach used in the Bgee project useful: https://bioconductor.org/packages/release/bioc/html/BgeeCall.html

ADD COMMENT • link 4 weeks ago by mbyvcm ▴ 480