Question: Question about ClinicalData discrepancies b/w cbioportal data and GDC data?
0
gravatar for Vasu
6 months ago by
Vasu370
Vasu370 wrote:

I recently downloaded the TCGA colorectal clinical data information from GDC portal. From this I got the following files.

nationwidechildrens.org_clinical_patient_coad.txt

nationwidechildrens.org_clinical_patient_read.txt

I combined both the files and a total of 628 patients data is available. Among them I see

563 - Alive
65 - Dead

For example

times   bcr_patient_barcode   patient.vital_status
 49         TCGA-5M-AAT4              Dead
 290        TCGA-5M-AAT6              Dead
154         TCGA-3L-AA1B              Alive
1200        TCGA-5M-AATE              Alive
648         TCGA-A6-2671              Alive

All the 628 patients have information available about Days_to_Last_followup.

Similarly, I checked the cbioportal TCGA Provisional colorectal clinical data cbioportal colorectal. Here the patient_vital_status is of different numbers.

502 - Alive
130 - Dead
 8 - NA

And in this, almost 60 patients had NA for Days_to_Last_followup. I'm interested in doing survival analysis. Now very confused to select the right one for the analysis.

For example

times   bcr_patient_barcode   patient.vital_status
 NA         TCGA-5M-AAT4              Dead
 NA         TCGA-5M-AAT6              Dead
154         TCGA-3L-AA1B              Alive
1200        TCGA-5M-AATE              Alive
648         TCGA-A6-2671               Dead

So, from the data above both GDC and cbioportal show different information.

Looks like cbioportal clinical data is the updated one as it shows more patients ad Dead. But why some patients in cbioportal clinical info doesnt have Days_to_Last_followup? Which of the above is the right one for the Analysis?

thanq

ADD COMMENTlink written 6 months ago by Vasu370
1

The GDC should be the most updated as it is the primary source of TCGA data. cBioPortal is a third-party (developed at MSKCC) that is not part of the NIH. The issue is that the clinical data may be referencing different samples / aliquots. cBioPortal may also have imputed missing values that they encountered in the original data that they pulled from the GDC.

I would always go by the data at the GDC because it is the primary source. It is a common finding that discrepancies exist between the GDC and the third party web-sites. You will be fine once you simply quote the exact source and version of your data. If no version is available, then date-stamp it in your methods.

Obviously patients cannot come back to life, so there are logical reasons behind the discrepancies that you observe.

ADD REPLYlink written 6 months ago by Kevin Blighe48k

If you say GDC is most updated one compared to cbioportal, I see 65 Dead in GDC and 130 Dead in cbioportal. This cannot be a small difference.

Should I ask GDC community ppl about this?

ADD REPLYlink written 6 months ago by Vasu370
1

They could simply be referencing different patients from the same cancer - I am not sure. I have also heard that the GDC clinical data contains errors. It would be interesting to also see how the patient numbers appear on the GDC Legacy Archive. I would contact both cBioPortal (MSKCC) and GDC.

As the analyst, in certain situations, the best we can do is just date-stamp and version control the data that's given us, i.e., in order to protect our own butts.

ADD REPLYlink written 6 months ago by Kevin Blighe48k

I second your suggestion (y)

Yes, there may be different patients in both cbioportal and GDC, but in my question there is one patient TCGA-A6-2671 which is alive in GDC and dead in cbioportal.

ADD REPLYlink written 6 months ago by Vasu370

The information / paper trail for the patient may be difficult to find. Another option: just set to NA all discrepancies between both the GDC and cBioPortal, although then you reduce your sample n

ADD REPLYlink written 6 months ago by Kevin Blighe48k

I see the patients are same in both the portals.

From the same place where I downloaded patient clinical data for both colon and rectal in GDC, I have also downloaded the following files

nationwidechildrens.org_clinical_follow_up_v1.0_coad.txt

nationwidechildrens.org_clinical_follow_up_v1.0_read.txt

I see the vital status is different in this compared to patient clinical data. What is this follow_up files?

GDC

ADD REPLYlink written 6 months ago by Vasu370

Those follow up files may be defined here: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization

Getting the most out of the clinical data from the TCGA is indeed difficult, I admit. It has a high level of missingness.

ADD REPLYlink written 6 months ago by Kevin Blighe48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1841 users visited in the last hour