I have recently discovered a potential biomarker and would like to validate its prognostic value in the TCGA dataset on late-stage melanama. I realized that one can make survival curves from the days_to_last_followup and days_to_death tabs, but the problem with that is that those survival data do not fully correlate with the related sequencing data. For instance, for a stage I melanoma patient it can be that the submitted_tumor_site is "Regional Lymp Node", which is incompatible with stage I. In other words, the staging was at the time of the original (earliest) diagnosis, and the submitted sample was from a relapsing tumor at a later date (and most likely higher stage). If I were to apply my biomarker to this set, in my opinion the above-mentioned sample would be mis-staged since the sequenced tumor has stage III/IV characteristics, while being staged as stage I.
An alternative approach would be to select samples based on the site of the submitted biopsy (for instance including only tumors that have spread into the regional lymph node), taking into account the fact that the biopsy was taken a number of days after the earliest diagnosis (the days_to_submitted_specimen_dx would provide me with that number). The problem with this is that (again) the staging should be taken into account, as staging obviously is a major determinant of outcome. Therefore, my question is whether the stage at the time the submitted biopsy was taken is available, and if so where I can find that (I have checked https://tcga-data.nci.nih.gov/docs/dictionary/ but did not find it there). If not, could anyone suggest to me what would be a fair alternative for coupling sequencing data to survival?
ps Sorry for being verbose but I found that survival and staging-related questions about the TCGA database are underrepresented and other Biostarrers might benefit from a slightly longer version of this post.
I have submitted my question to the TCGA, and I am pasting their entire answer below:
We may not have the exact time interval with corresponding staging as
you request, but below is an explanation of each clinical variable. I
hope that you can use this information for your analysis:
The only overall stage that TCGA collected for SKCM is the
"ajcc_pathologic_tumor_stage" in the clinical_patient_skcm.txt file.
As you indicated, this reflects the stage at initial pathologic
diagnosis and this diagnosis is not necessarily the event that yielded
the biospecimen sent to the BCR. Unfortunately, TCGA did not collect
the stage specifically at the time that the specimen sent to the BCR
The " days_to_initial_pathologic_diagnosis" indicates the date of
initial melanoma diagnosis. The "submitted_tumor_dx_days_to" indicates
the date of diagnosis for the sample submitted to the BCR (actually
days from the initial melanoma diagnosis).
There is also a "days_to_sample_procurement" in the
nationwidechildrens.org_ssf_tumor_samples_skcm.txt file. This
indicates the days to cancer sample procurement for the sample
submitted to the BCR for TCGA in relation to the date of initial
If you filter "days_to_sample_procurement" for 0 (or within a number
of days) and use primary tumor (submitted_tumor_site) samples, the
"ajcc_pathologic_tumor_stage" should reflect the stage at the time the
submitted biopsy was taken.
Indeed, as suggested by the TCGA, the days_to_sample_procurement is the more accurate tab to define the date that the tumor was obtained (rather than the days_to_submitted_specimen_dx I mentioned in my original post).
Without wanting to dive into the pathology reports (yet), I see a number of possibilities:
Filter based on the site of the biopsy. For instance, if submitted_tumor_site is "Distant Metastasis", this is by definition from a stage IV tumor. Alternatively, if it is "Regional Lymph Node" it should be stage III or stage IV. In this case, the number of days that can be used for survival curves are last_contact_days_to - days_to_sample_procurement (censored) and death_days_to - days_to_sample_procurement (not censored).
Filter for days_to_sample_procurement around 0 days. Indeed as suggested in the reply by the TCGA team, the stage obtained from ajcc_pathologic_tumor_stage should reflect the stage at the time the biopsy was taken. In this case, the above-mentioned formulas for calculating days for the survival curves can be used too.
Not to care about the mis-staging of samples (not my favourite option!)