is it correct to pre-select a set of genes to perform differential expression analysis using deseq2 For example, my first comparison would be tumoral vs non-tumoral tissue, and the set of genes I get (over 10.000 DE genes) I would use to compare for example, patients that recur vs patients that did not recur, using just that set of genes differentially express in the last comparison (tumoral vs non-tumoral)
There was a similar quuestion recently, asking if removal of a large set of genes in order to save computional time is valid. Without being a statistician, purely based on my (naive) understanding of DESeq2, I assumed that any removal of (a large number of ) genes, or in your case subsetting to certain genes might violate the assumptions of DESeq2. In your case, the question is if the median ratio of the chosen genes will still capture the true size relationships between the datasets (e.g. sequencing depth), as this is the basis for the normalization process. In other words, do the chosen genes allow to scale the different samples appropriately to each other. Why don't you choose the patients of interest based on the first analysis, assign factor levels to them, "recurr" / "non-recurr", rerun DESeq2 on the full set of genes and then check if your target genes come out as DE?
Pre-selecting genes for differential expression based on differential expression is generally going to be challenging to justify if there is a nested design (samples overlap between test set #1 and test set #2). If these are two different datasets, then perhaps this can be more easily justified.
From a biological point-of-view, it is quite possible and believable that genes that are associated with recurrence are not differentially expressed between tumor and normal, so it is also quite possible that including only those "first" differentially expressed genes in a second comparison will lead to false negatives.