Question

What should be done with newly found genes DEGs in over-representation analysis

0

Entering edit mode

16 months ago

ghs101 • 0

Hi,

I have about 20 newly found differently expressed genes in my dataset (total DEGs 460) with no ENSEMBLE id. Should I leave them in or remove them from the downstream analysis? What is the correct procedure for over-representation analysis in this case?

GSE analysis enrichment over-representation • 488 views

ADD COMMENT • link updated 16 months ago by ATpoint 82k • written 16 months ago by ghs101 • 0

score 2 · Answer 1 · 2022-12-27

In my opinion the "correct" way of doing any enrichment/overrepresentation (for example against REACTOME terms) is to define the "universe" or "background" correctly.

The test set is all DEGs, filtered for genes that have an annotation in the database
The background/universe is all genes eligable for DEG analysis. In case of something like DESeq2 that would be the genes surviving the independent filtering (=not having NA in the padj column) or in edgeR that would be genes after applying filterByExpr, again filtered for genes that have an annotation in the database.

Functions like enricher() from clusterProfiler support such an analysis. Setting appropriate background is critical to obtain meaningful statistics. It obviously makes a difference if you enrich for example against 8,000 genes that meet the criteria for your "universe" (so annotated in the database such as REACTOME and eligable in your analysis (=expressed)), versus just using all like 50,000 annotated genes regardless of expression status. The latter would give much more lenient and inflated statistics, but a lot of false positives.

That having said, if a gene has no annotations then you anyway cannot get any enrichment results for it in terms of known pathways, hence I'd remove it.