Question

Selecting clinical rows for repeated case IDs in TCGA-BRCA data

0

Entering edit mode

6 weeks ago

BhagyashreeWaghale • 0

Hello everyone,

I’m working with the open-access TCGA Breast Cancer (TCGA-BRCA) data and have downloaded both the clinical data and the corresponding gene expression count files (.tsv). I’ve noticed that the clinical dataset contains multiple rows for the same case ID.

While many columns remain consistent across these duplicate entries, several important clinical fields differ, such as:

diagnoses.age_at_diagnosis
diagnoses.ajcc_pathologic_m
diagnoses.ajcc_pathologic_n
diagnoses.ajcc_pathologic_stage
diagnoses.ajcc_pathologic_t
diagnoses.diagnosis_is_primary_disease
diagnoses.morphology
diagnoses.laterality
diagnoses.site_of_resection_or_biopsy
diagnoses.prior_treatment
and other treatment-related columns.

I’m unsure how to decide which row to select when a case ID has multiple entries with varying information in these key columns. I’ve searched for documentation or guidelines on this, but haven’t found anything specific.

Has anyone encountered this issue before? How do you typically handle or resolve these conflicting entries for a single case ID in TCGA clinical data?

Any suggestions would be greatly appreciated. Thank you!

Clinical-Data Breast-Cancer TCGA • 430 views

ADD COMMENT • link updated 6 weeks ago by Zhenyu Zhang ★ 1.3k • written 6 weeks ago by BhagyashreeWaghale • 0

score 1 · Answer 1 · 2025-05-11

1

Entering edit mode

6 weeks ago

Zhenyu Zhang ★ 1.3k

I assume you downloaded the data from GDC. First, GDC has a very helpful help desk support@nci-gdc.datacommons.io you can ask. Secondly, I assume you just encountered patients with multiple diagnosis. What to do with them is your own judgement call based on what's your own criteria.

ADD COMMENT • link 6 weeks ago by Zhenyu Zhang ★ 1.3k