GDC TCGA: cannot find CNV files by their UUIDs
0
0
Entering edit mode
3.0 years ago
Ld_60 ▴ 40

Hi everyone,

I am trying to find the corresponding TCGA sample IDs for a set of DLBC tumour samples of CNV data (downloaded in May 2018) using their file UUIDs. I am using the code provided here. The problem is that the function described in that link returns an empty table, i.e. the UUIDs of my samples are not found in GDC.

I tried to manually search for a UUID of one of my samples (da4b04a1-700b-4022-a56c-11329b8106cc) in the GDC repository (https://portal.gdc.cancer.gov/repository) and I could not find it! The file appears to have been deleted, but I am not sure about that. I then tried to filter the repository to only keep TCGA-DLBC Masked CNV samples (files) and looked for my sample by its file name (XYLEM_p_TCGASNP_207_212_N_GenomeWideSNP_6_A01_1051280.nocnv_grch38.seg.txt). I found a file having the exact name as my sample, with the exception that it ends with .seg.v2.txt (v2 added between seg and txt). However, the file has a completely different UUID (193bdf2c-7e6c-44b7-9cd8-004504b3e7a1).

Could this mean that the v2 file is a newer version of the one I am looking for ? If it is the case, how is it possible to find the samples I have by their file UUIDs ?

Thank you very much for your help!

TCGA GDC CNV UUID • 1.2k views
1
Entering edit mode

I had posted another script here, which converts filenames to TCGA barcodes: C: problem in matching the names between file names and patients Id in TCGA However, it can neither find your file.

Going by the 'brute force' approach to which I allude here, A: Sample names for TCGA data from GDC-legacy archive, there is data for hg18 and hg19 for this sample in GDC Legacy. In the main GDC, I do only see the v2 sample (hg38). One must assume that the sample was repeated for some reason and that v2 reflects the one to use.

By the way, if you are doing copy number analysis, you may be interested in looking here: A: How to extract the list of genes from TCGA CNV data That avoids the need to download any data from GDC and instead starts with the Broad Institute's pre-processed data.

1
Entering edit mode

Hi Kevin, thanks a lot for your answer. I've actually found out that the CNV data has been updated in June 13th 2018 as part of the 12.0 Data Release. In the corresponding release notes, the following is stated: "Updated Copy Number Segment and Masked Copy Number Segment files are now available. These were generated using an improved mapping of hg38 coordinates for the Affymetrix SNP6.0 probe set". Since I have the data from May 2018, this is the reason why the files are not available. I am still trying to figure out how I can search for the data from previous releases.

1
Entering edit mode

Thanks for the follow up, Ld_60. I am not sure you'll find the previous release in the public domain. You will likely have to contact the GDC. They have responded to me quickly in the past, provided I use an academic affiliation.