I've been looking at the TCGA clinical data, queried using TCGAbiolinks:
allproj <- getGDCprojects() projs <- allproj[startsWith(allproj$id, 'TCGA'),]$id clin <- lapply(projs, FUN = GDCquery_clinic, 'clinical')
I was surprised to see some patients with death dates before TCGA data collection started in ~2006. There are 7,510 patients reported Alive and 3,641 reported Dead. Of the 3,641 dead 2,705 have a year_of_death listed, and of those 31% have a year_of_death earlier than 2005, some as early as 1990.
Does anyone know what's going on here? Are these dates correct, and if so were samples not taken at diagnosis?
Also, the last recorded death date is 2014. Were patients followed at all after TCGA finished in 2014 or are some of the patients listed as Alive actually dead now? What timeframe does the vital status refer to?