Selecting clinical rows for repeated case IDs in TCGA-BRCA data
1
0
Entering edit mode
6 weeks ago

Hello everyone,

I’m working with the open-access TCGA Breast Cancer (TCGA-BRCA) data and have downloaded both the clinical data and the corresponding gene expression count files (.tsv). I’ve noticed that the clinical dataset contains multiple rows for the same case ID.

While many columns remain consistent across these duplicate entries, several important clinical fields differ, such as:

  • diagnoses.age_at_diagnosis
  • diagnoses.ajcc_pathologic_m
  • diagnoses.ajcc_pathologic_n
  • diagnoses.ajcc_pathologic_stage
  • diagnoses.ajcc_pathologic_t
  • diagnoses.diagnosis_is_primary_disease
  • diagnoses.morphology
  • diagnoses.laterality
  • diagnoses.site_of_resection_or_biopsy
  • diagnoses.prior_treatment
  • and other treatment-related columns.

I’m unsure how to decide which row to select when a case ID has multiple entries with varying information in these key columns. I’ve searched for documentation or guidelines on this, but haven’t found anything specific.

Has anyone encountered this issue before? How do you typically handle or resolve these conflicting entries for a single case ID in TCGA clinical data?

Any suggestions would be greatly appreciated. Thank you!

Clinical-Data Breast-Cancer TCGA • 430 views
ADD COMMENT
1
Entering edit mode
6 weeks ago
Zhenyu Zhang ★ 1.3k

I assume you downloaded the data from GDC. First, GDC has a very helpful help desk support@nci-gdc.datacommons.io you can ask. Secondly, I assume you just encountered patients with multiple diagnosis. What to do with them is your own judgement call based on what's your own criteria.

ADD COMMENT

Login before adding your answer.

Traffic: 3460 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6