Hello everyone,
I’m working with the open-access TCGA Breast Cancer (TCGA-BRCA) data and have downloaded both the clinical data and the corresponding gene expression count files (.tsv). I’ve noticed that the clinical dataset contains multiple rows for the same case ID.
While many columns remain consistent across these duplicate entries, several important clinical fields differ, such as:
- diagnoses.age_at_diagnosis
- diagnoses.ajcc_pathologic_m
- diagnoses.ajcc_pathologic_n
- diagnoses.ajcc_pathologic_stage
- diagnoses.ajcc_pathologic_t
- diagnoses.diagnosis_is_primary_disease
- diagnoses.morphology
- diagnoses.laterality
- diagnoses.site_of_resection_or_biopsy
- diagnoses.prior_treatment
- and other treatment-related columns.
I’m unsure how to decide which row to select when a case ID has multiple entries with varying information in these key columns. I’ve searched for documentation or guidelines on this, but haven’t found anything specific.
Has anyone encountered this issue before? How do you typically handle or resolve these conflicting entries for a single case ID in TCGA clinical data?
Any suggestions would be greatly appreciated. Thank you!