When using lasso regression for transcriptomic data, we reduce number of independent variables based on collinearity effects. However, this method is ignorant of molecular topography in expression networks and I think, as a consequence of dimension reduction, we lose statistical significance in follow up gene ontology pathway analysis.
For example; specific RNA isoforms or downstream transcripts, which may be affected by multiple upstream regulators such as transcription factors or splicing factors, may be kept in a lasso regression model due to the large changes when associated with dependent variable (age in this case), but the upstream drivers may be removed due to collinearity. So we lose important information regarding perhaps drivers of cytokines, signals, metabolic pathways or inflammatory networks etc.
In addition the output will have reduced other RNA isoform species from the same co-expression network due to collinearity in the same expression network. As such, our model will produce a list of features which are as much as possible linked to the dependent variable, but not each other. When then performing gene ontology pathway analysis, the enrichment outputs are muted? This will ultimately hide potential therapeutic molecular targets for intervention? Please see my crude graphical representation for clarity of my assumptions. Is this correct? Do I misunderstand penalization of collinearity?