I have recently discovered a potential biomarker and would like to validate its prognostic value in the TCGA dataset on late-stage melanama. I realized that one can make survival curves from the days_to_last_followup and days_to_death tabs, but the problem with that is that those survival data do not fully correlate with the related sequencing data. For instance, for a stage I melanoma patient it can be that the submitted_tumor_site is "Regional Lymp Node", which is incompatible with stage I. In other words, the staging was at the time of the original (earliest) diagnosis, and the submitted sample was from a relapsing tumor at a later date (and most likely higher stage). If I were to apply my biomarker to this set, in my opinion the above-mentioned sample would be mis-staged since the sequenced tumor has stage III/IV characteristics, while being staged as stage I.
An alternative approach would be to select samples based on the site of the submitted biopsy (for instance including only tumors that have spread into the regional lymph node), taking into account the fact that the biopsy was taken a number of days after the earliest diagnosis (the days_to_submitted_specimen_dx would provide me with that number). The problem with this is that (again) the staging should be taken into account, as staging obviously is a major determinant of outcome. Therefore, my question is whether the stage at the time the submitted biopsy was taken is available, and if so where I can find that (I have checked https://tcga-data.nci.nih.gov/docs/dictionary/ but did not find it there). If not, could anyone suggest to me what would be a fair alternative for coupling sequencing data to survival?
ps Sorry for being verbose but I found that survival and staging-related questions about the TCGA database are underrepresented and other Biostarrers might benefit from a slightly longer version of this post.