There's now tons of different packages available to do statistical analysis of gene expression data to assess differential expression. Many of them, like mainstays such as edgeR and DESeq, carefully estimate variance parameters for individual genes/contigs (through various fancy means), then fit linear models and ultimately rank your genes/contigs by p-value (or similar). The subset of contigs you believe to be differentially expressed are those that survive some FDR moderation procedure.
Another way to look at it is as a classification problem: say we're given gene expression measurements from two conditions, then you could perform suitably regularized logistic regression - let's say with a LASSO/sparsity penalty on the coefficients - and then pull out the subset of genes/contigs with nonzero coefficients after optimizing the sparsity penalty through some procedure.
What is the reason to prefer one over the other? Are the two methods really conceptually equivalent but stem from different scientific communities?
Maybe I'm misunderstanding, but the second approach is not well-defined. If you have two conditions (even with replicates) and thousands of candidate predictors, there will be many correlated features. In the specific case of the LASSO, which particular correlated feature (gene) it pulls out and gives a nonzero weight is not consistent. In the larger classification case, you can imagine that there are many genes or sets of genes that almost fully indicate which condition a sample comes from, and deciding between them will not yield stable results in the presence of noise. In general, the coefficient values you get after such a classification problem don't satisfy the goals you would want (maybe a well-calibrated type I error rate for each gene decision under some hypothesis testing framework, etc.). There is recent work on addressing this (e.g. "A significance test for the lasso"), but it's not completely settled yet. All the magic will be in the "optimizing the sparsity penalty through some procedure" part, as you say, but it's not clear at all (at least to me) what that would look like.
So I guess to summarize, my opinion is that the methods are not conceptually equivalent.
Great answer, thank you.
As I understand you, there are two pointst: 1) The LASSO estimator is not consistent, and 2) The genes/features are not i.i.d., so even if it was consistent it would not yield a biologically meaningful answer.
In regards to 1), it was shown in 2008 that bootstrapping leads to consistency in the LASSO estimator: http://www.di.ens.fr/~fbach/fbach_bolasso_icml2008.pdf - though in the gene expression case, as you rightly point out, this doesn't mean a whole lot.
I am curious though, there are tons of papers on selecting e.g. cancer biomarkers from gene expression data via classification methods, are they all more or less invalid/misguided exactly because they don't intrinsically support the correlation structure of the genes?
Also, the problem of i.i.d. assumption violation seems to me to be relevant for the hypothesis testing methods as well, no? If you're testing all genes independently, even though you fancifully shrink their variance estimates towards each other, you will never have a perfectly well behaved test. But maybe I'm wrong?
I don't think the biomarker papers, as a class, are misguided... I think they just focus on a different problem. The question of predicting some sample label is well-posed, and you can tune and assess your accuracy through many different methods (cross validation, held-out sets, replication, etc.). I just think that attacking that problem through say a discriminative machine learning approach, then reverse engineering to get strong inferential statements about the particular features in the model, is not the best way to go. I suppose you can do it, like the bootstrapping paper you sent, or empirically determining (say) through permutation which regularization penalty corresponds to which FDR value. But overall this framework is not very natural, and is full of potential pitfalls.
A lot of the biomarker papers don't try to make strong statements about which genes are (or aren't) in their predictive model, but it's fine because their overall goal of sample prediction is rigorously evaluated. Here I'm thinking of the early AML/ALL tumor classification paper as an example, where they chose 50 genes to use as predictors and acknowledged that "[t]he choice to use 50 informative genes in the predictor was somewhat arbitrary."
For your final point, I don't have a good answer. I agree that the complicated variance shrinkage methods (across genes) introduce subtleties that are hard to model and guarantee the performance. I suppose here we may just have to fall back to the empirical tests that have been performed (or we can perform ourselves) that show that the final statistics are well calibrated.
does the second approach account for the possibility that no genes are differentially expressed?
Good question: intuitively, if the conditions have no separating hyperplane then the resulting classifier will have the same performance as random guessing. You can test this by drawing the Area Under the Curve of the Receiver Operating Characteristic; in the case of random guessing, the AUC is a straight line.
I'd guess that the latter method would miss things that the former wouldn't, since they're sort of targeted toward different goals. It'd be interesting to see a comparison, though.