Question

DESeq2 Normalization and Pre-Filtering

0

Entering edit mode

11 months ago

turcoa1 • 0

Hello,

I am currently using DESeq2 for differential gene expression and am confused about prefiltering. First off, I know DESeq2 has Independent filtering but this is only when using the results() function. Does this mean that calling counts(dds, normalized=TRUE) will give me counts for all the genes and not the filtered genes? How can I use the counts of the genes for ONLY THE FILTERED GENES that DESeq2 finds using Independent filtering.

Also, is it better practice to filter prior to normalizing, or is it better to normalize first, then filter normalized counts.

Thanks

DESeq2 RNA-Seq Differential-Gene-Expression • 1.9k views

ADD COMMENT • link updated 11 months ago by LauferVA 4.2k • written 11 months ago by turcoa1 • 0

score 2 · Answer 1 · 2023-06-16

2

Entering edit mode

11 months ago

ATpoint 82k

I personally like to do prefiltering upfront using the filterByExpr() function from edgeR. It is automated, simple and stood well the test of time plus it is aware of the experimental design so automatically respects sample and group sizes.

Prefiltering mainly removes genes with spurious counts and unreliable outliers and improves normalization and dispersion estimation. You can always use the normalization or size factors from this filtered datasdt and feed it back to the unfiltered dataset to get normalized counts for all genes in case that is necessary.

Does that make sense?

ADD COMMENT • link 11 months ago by ATpoint 82k

0

Entering edit mode

I completely agree this is a valid approach. Also, neither of us mentioned that you can generate results using results(Dds) then access the matrix of Cook's Distances early on as well, i.e.:

P0 --> pre-filter raw counts --> P1 --> P2 valid P0 --> make res object --> parse Cook's distances before or after pre-filtering

Most often these two orders produce approximately the same results granted that the filters themselves are equally stringent in both cases. ATpoint would love to hear if your experience is different than this. I'll also readily admit that your approach is more theoretically sound in terms of processor wall-time. But even more datasets of up to N=1000, either order is quick.

ADD REPLY • link 11 months ago by LauferVA 4.2k

score 1 · Answer 2 · 2023-06-15

----- Pre-analytical workflow -----

P0. Format raw objects to fit into Dds object, which is a correctly formatted count matrix bound together with metadata info, referred to as the colData(). Your design can be ~1.

P1. Normalize

P2. After normalization, mark bad variants for exclusion

P3. After filtration, variance stabilization (by convention, dds object now called vst or rld)

P4. Using the vst or rld object, generate Distance Matrices

P5. PCA on the vst object

P6. Based on visual inspection of 4 and 5, mark bad samples, which should be observable as excessively distant samples, even from others of the same time.

----- Analytical workflow -----

A0. As before, once again load the RAW metadata and count data into Dds

A1. Exclude the bad variants identified in P2 from the count matrix of Dds, and name it TrimDds.

A2. Remove outlying samples identified in P5 from TrimDds, and name it SlimTrimDds.

A3. Correctly format your metadata (i.e., colData(SlimTrimDds) ). This involves removal of NA values in any columns you intend to use for modeling, and correct assignment of variables to factors, ordered factors, or numerics.

A4. Specify a biologically meaningful and mathematically cogent design formula, which we will call Model1. Doing this well presupposes extensive knowledge about the biology you are trying to model AND the purpose(s) of steps P2-P. Few biologists do the former, and few bioinformaticians master the latter, and finding both of those in one person is harder still.

A5. Fit that Model statement onto SlimTrimDds with design(SlimTrimDds)<-Model1

A6. Run Mod1Dds<-DESeq2(SlimTrimDds), then run Mod1Res<-results(Mod1Dds)

A7. - AN. Any and all downstream, post-analytical workflows.