I have a question regarding external validation in bioinformatics analyses. In many research articles, particularly in the context of cancer research, researchers investigate the role of a specific gene set in a particular cancer type, such as examining the impact of genes related to the ferroptosis pathway in kidney renal clear cell carcinoma (KIRC), often using datasets like TCGA-KIRC. The typical analysis workflow involves initial steps like weighted gene co-expression network analysis (WGCNA) or differential expression analysis (DEG), followed by univariable survival analysis, and sometimes multivariable survival analysis or LASSO. Subsequently, a risk score is constructed using selected genes, typically following this formula:
Risk score for each patient = (coefficient of gene 1 expression of gene 1) + (coefficient of gene 2 expression of gene 2) + (coefficient of gene 3 * expression of gene 3) + ....
My question pertains to the validation step. When confirming the performance of this risk score in external datasets from sources like GEO or ICGC, there is a lack of clarity in many research articles. Specifically, it's unclear whether the same coefficients derived from the training dataset are used to create the risk score in the validation dataset, or if all the survival analysis steps are re-executed in the validation dataset? Could someone possibly guide me about this matter? I carefully read many relevant articles; unfortunately, that articles do not provide detailed information on this aspect. Only one paper has mentioned using the same coefficients for risk score validation. I would greatly appreciate it if you could share your experience or knowledge on this matter.
Thank you in advance.