Question

Differential expression analysis in a short gene list

0

Entering edit mode

3 months ago

yixinzeng • 0

Hello, I've recently encountered a problem. I have a short gene set of interest, consisting of about a dozen genes. I'm interested in exploring the differential gene expression within this gene set between two types of samples. Here are my questions:

Should I perform a full differential expression analysis (for example, using DESeq2), and then filter the results based on the gene set? Or should I first filter the target genes from the counts matrices, and then proceed with the subsequent statistical analysis? (I guess that performing statistical tests on a small part of the gene set would obviously yield better results than performing statistical tests on the entire gene set.)
If it's the latter, what statistical method should I use? Are workflows like DESeq2 still applicable?

differential-expression DESeq2 RNA-seq • 510 views

ADD COMMENT • link 3 months ago by yixinzeng • 0

score 4 · Accepted Answer · 2024-01-08

4

Entering edit mode

3 months ago

i.sudbery 19k

In these sorts of cases, I recommned running DESeq2 on the full transcriptome, as it will enable better normalisation and better gene information sharing in dispersion estimates.

Once the analysis is run on the full transcriptome, I would then subset to your genelist of interest and recalculate the FDRs to account for the smaller geneset:

deseq_results <- as.data.frame(deseq_results) %>%
    rowname_to_column("gene_id") %>%
    filter(gene_id %in% list_of_ids) %>%
    mutate(padj = p.adjust(pvalue, method="BH")

ADD COMMENT • link 3 months ago by i.sudbery 19k

0

Entering edit mode

Thank you for your detailed answer! But I'm still worried about multiple testing correction.

Considering the impact of type I error, if I first normalize, then subset my matrices and perform a simple statistical test like a t-test, this would involve far fewer statistical tests than running on the full transcriptome. Is this feasible, and would it be more effective than running on the full transcriptome?

In general, I am only interested in a small part of the whole genes. Is it really necessary to run on the full transcriptome? Would this introduce more errors?

ADD REPLY • link 3 months ago by yixinzeng • 0

1

Entering edit mode

If you're going to perform normalization like FPKM, TPM and differential analysis by wilcoxon-test or t-test, I think it's ok to run on your interested gene list. But if you want to use DESeq2, you'd better perform it on the whole transciptome.

DESeq2 make use of genes with median experssion level for normalization. Assuming the abundance of most genes are similar among all samples, these genes can represent housekeeping genes or genes with median experssion level. If you perform it within the interested gene set, this assumption may not hold.

ADD REPLY • link 3 months ago by zau saa ▴ 120

0

Entering edit mode

Thank you very much!

ADD REPLY • link 3 months ago by yixinzeng • 0

1

Entering edit mode

No, the full transcriptome gives you much more power to do the normalisation and dispersion estimation accurately. Also, t-test is not valid on count data.

But you shouldn't we worried about multiple testing correction, because the idea of the code above is that if you have a gene list of 100 genes, you run a full 20,000 tests, but then, because you only ever look at 100 of them, you adjust the multiple testing correction as if you had only ever done 100 tests. This is valid because you throw the other 19,900 tests out without looking at them.