I have read a few pan-cancer analysis papers, really big papers from CNS. Then, I am confused about the way what they are doing with the data.
Normally, the data come from TCGA or other similar databases, these data are collected without any scientific hypothesis beforehand obviously, just dumped from bunches of sequencings and arrays (surely with careful selection, qc, normalization, etc). What those paper normally do is first to find statistical differences across all samples, cancer-types, genes, etc, then they 'zoom-in', to compare different subset of samples, cancer types, genes or other stuffs of interest, in order to find more delicate/subtle statistical differences, more interesting phenomena. At last, make up a story about it.
My question is, Isn't that a violation about statistical test assumptions? Aren't that comparisons multiple comparisons? Should we really analyze data after we see them and without any scientific hypothesis in advance?