Sample names for TCGA data from GDC-legacy archive
2
7
Entering edit mode
4.3 years ago
Vasu ▴ 710

Hi,

As I needed RNAseq raw sequencing data I downloaded the rnaseq manifest file from GDC legacy archive and with the token I downloaded rnaseq raw data.

The manifest looks like this:

id  filename    md5 size    state
d1017f74-3a39-4427-af57-273e34247b49    UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz   ed7f23aa9540ef0242cb6ddde30d1aca    5830465428  live
5e2d5c52-596f-49bc-967c-42129abbacbf    UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz  b1f03852b2ac3c3cd50cb4a87f2a116a    7587398372  live
e01ca3e0-beb0-46b7-bb7c-f5b16f966918    UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz   6e6a26fcce8e84d209b1475249a922de    5187498148  live
992a7083-28ce-4857-898e-9d4b4fbf2fa1    UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz   bb9e19a5f286ff37bf95cb0c307930ea    6717741168  live
230082b7-39ec-4fe1-b3c6-daf35458f396    UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz   d93777efebc921e2539aa2b7081da6d4    4766879929  live
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2    UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz   281083ac338f67145ede3d8ef3f4300f    7504345358  live


After the download I have folders with names present in "id" column. Inside each folder there is tar.gz file.

For eg:

d1017f74-3a39-4427-af57-273e34247b49
|___ UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz


When I extracted the tar.gz files I got the fastq files like below:

110801_UNC12-SN629_0115_BD0DVEABXX.3_1.fastq
110801_UNC12-SN629_0115_BD0DVEABXX.3_2.fastq


What is the sample name here? Looks very confused.

RNA-Seq tcga gdc • 12k views
2
Entering edit mode

As this is controlled data, you could log in at the main GDC ( https://portal.gdc.cancer.gov/ ) and use the search box to search for the file-names - they should be there. You would then obviously look for the UUID or TCGA barcode.

To do this programmatically, there are APIs but, the last time that I tried them, they were offline. There has been a lot of data being moved around relatively recently for the TCGA. One way that I did it was to download the JSON manifest for my data (from the Legacy Archive) and then use a loop in R to pull out the CASE ID (which is the UUID), in this case, which I then used to identify the patients. Here's the loop that I used (slow; sample filenames are in filenames object):

require(rjson)
manifest <- fromJSON(file="RNAseqManifest.json")

#Look up each filename's UUID from the manifest
fileUUIDs <- c()
for (i in 1:length(filenames))
{
record <- manifest[[grep(filenames[i], manifest, fixed=TRUE, ignore.case=FALSE)]]

if (filenames[i]!=record$file_name) { print("FALSE") } fileUUIDs[i] <- record$cases[[1]]$case_id }  ADD REPLY 0 Entering edit mode sorry, didn't get what is filenames object. what is sample filenames? ADD REPLY 0 Entering edit mode Just a vector of your filenames, such as: filenames <- c("UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz", ..., "UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz")  ADD REPLY 0 Entering edit mode As given above in the manifest I already have UUID. What I need is TCGA sample name. For this do I need to login into GDC and search? ADD REPLY 0 Entering edit mode I see. For UUID-to-TCGA barcode mapping, I was able to just use one of the clinical data files in BioTab format (also available at Legacy Archive). For example, here is the file for breast cancer: https://portal.gdc.cancer.gov/legacy-archive/files/735bc5... The first 2 columns of that file are: • bcr_patient_uuid • bcr_patient_barcode ADD REPLY 0 Entering edit mode Somehow the below code is not working anymore. However "UUIDtoBarcode" function from TCGAutils R package is giving the solution. ADD REPLY 0 Entering edit mode That is expected. The function that I wrote above is inefficient and served a specific purpose at that time. Did you try Vasu's (Sean Davis's) code below? ADD REPLY 11 Entering edit mode 4.3 years ago Vasu ▴ 710 Best way to do it. library(GenomicDataCommons) manifest <- read.table("gdc_manifest_rnaseq_fastq.txt")  manifest: id filename md5 size state d1017f74-3a39-4427-af57-273e34247b49 UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz ed7f23aa9540ef0242cb6ddde30d1aca 5830465428 live 5e2d5c52-596f-49bc-967c-42129abbacbf UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz b1f03852b2ac3c3cd50cb4a87f2a116a 7587398372 live 2ef74f93-5da2-454c-aca2-d86c289eacb8 UNCID_2206802.25be50e7-7705-492d-a44a-0e40180d10c8.110901_UNC12-SN629_0127_BC025UABXX_1_CTTGTA.tar.gz a965da78ada814a35702fd65209b500a 7867889236 live e01ca3e0-beb0-46b7-bb7c-f5b16f966918 UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz 6e6a26fcce8e84d209b1475249a922de 5187498148 live 992a7083-28ce-4857-898e-9d4b4fbf2fa1 UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz bb9e19a5f286ff37bf95cb0c307930ea 6717741168 live 230082b7-39ec-4fe1-b3c6-daf35458f396 UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz d93777efebc921e2539aa2b7081da6d4 4766879929 live 9bbada51-d827-4eea-af45-47d7b5ba137e UNCID_2206522.147d6ebb-7359-449a-9e6a-6c8443ebaa2e.110919_UNC13-SN749_0113_AB00WUABXX_3_CGATGT.tar.gz 7bcfe71256ba172fa605bc4ddc04f9c7 7606309309 live db1b68b0-dc0a-48a5-8acb-4cd45ea186e2 UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz 281083ac338f67145ede3d8ef3f4300f 7504345358 live file_uuids <- manifest$id

d1017f74-3a39-4427-af57-273e34247b49
5e2d5c52-596f-49bc-967c-42129abbacbf
2ef74f93-5da2-454c-aca2-d86c289eacb8
e01ca3e0-beb0-46b7-bb7c-f5b16f966918
992a7083-28ce-4857-898e-9d4b4fbf2fa1
230082b7-39ec-4fe1-b3c6-daf35458f396
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2

library(GenomicDataCommons)
library(magrittr)

TCGAtranslateID = function(file_ids, legacy = TRUE) {
info = files(legacy = legacy) %>%
filter( ~ file_id %in% file_ids) %>%
select('cases.samples.submitter_id') %>%
results_all()
# The mess of code below is to extract TCGA barcodes
# id_list will contain a list (one item for each file_id)
# of TCGA barcodes of the form 'TCGA-XX-YYYY-ZZZ'
id_list = lapply(info$cases,function(a) { a[[1]][[1]][[1]]}) # so we can later expand to a data.frame of the right size barcodes_per_file = sapply(id_list,length) # And build the data.frame return(data.frame(file_id = rep(ids(info),barcodes_per_file), submitter_id = unlist(id_list))) } res = TCGAtranslateID(file_uuids) head(res) file_id Submitter_id d1017f74-3a39-4427-af57-273e34247b49 TCGA-E9-A1NA-11A 5e2d5c52-596f-49bc-967c-42129abbacbf TCGA-AO-A12H-01A 2ef74f93-5da2-454c-aca2-d86c289eacb8 TCGA-AC-A23E-01A  I found the soution from seandavis blog https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/ ADD COMMENT 3 Entering edit mode For the record, Bioinfo's answer is Sean Davis's blog post which can be seen here: https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/ If you are using content that is not yours, please cite it. Also, we've added this functionality with Sean's permission to the TCGAutils package on Bioconductor. Best regards, Marcel ADD REPLY 0 Entering edit mode Thank you for catching that, Marcel. ADD REPLY 0 Entering edit mode I just posted the answer how I solved the problem..Yes may be I must have posted the link. I totally forgot about that while posting the solution. Sorry for that. ADD REPLY 1 Entering edit mode Yes, you most definitely should have posted the link especially if the content was directly taken from the blog. You may not have intended to omit the attribution, but that omission makes you look bad. Gotta be extra careful, unfortunately. ADD REPLY 0 Entering edit mode Sure, I will be little careful from next time. thanq ADD REPLY 1 Entering edit mode That's pretty cool - I've moved this to an answer. Please feel free to Accept it, as this will help others. ADD REPLY 0 Entering edit mode However, I received following error in Rstudio Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('gdc_files', 'GDCQuery', 'list')" ADD REPLY 0 Entering edit mode It should be "filter", not "filter_" ADD REPLY 0 Entering edit mode Bioinfo, this exact function was working 2 weeks ago. Something has changed, either at the GDC server or in the version of the GenomicDataCommons package (or elsewhere) I was trying it yesterday and it neither worked, unfortunately (tried from 2 different places). ADD REPLY 0 Entering edit mode Just now I gave a try with the above mentioned code. It worked for me. ADD REPLY 0 Entering edit mode I tried it just now and it now gives this: Error in .gdc_post(entity_name(x), body = body, legacy = x$legacy, token = NULL,  :
In strptime(x, fmt, tz = "GMT") :
unknown timezone 'zone/tz/2018c.1.0/zoneinfo/Europe/London'

0
Entering edit mode

It could be due to the versions. Not sure. These are the versions I'm using.

GenomicDataCommons_1.5.3
BiocInstaller_1.31.1
R version 3.5.0

2
Entering edit mode

I modified the function to accept filenames, too: C: problem in matching the names between file names and patients Id in TCGA

Thanks.

2
Entering edit mode

Also got the function working again by updating directly from GitHub:

require(devtools)
install_github("Bioconductor/GenomicDataCommons")

1
Entering edit mode

hi,

I tried your code but getting error like this Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: gdc-api.nci.nih.gov

Any suggestion or help is much appreciated.

Thanks

1
Entering edit mode

Try to install the development version of GenomicDataCommons:

require(devtools)
install_github("Bioconductor/GenomicDataCommons")

1
Entering edit mode
3.3 years ago
david.peeney ▴ 30

copying my answer from Tutorial: TCGA UUIDS to TCGA barcode (SampleID) in R

Using the GenomicDataCommons method only gives you short barcodes (for identifying patients), which is not particularly useful when dealing with duplicate samples. A good method I found that gives me harmonized UUIDs and full aliquot barcodes is:

Firstly, you need to download the JSON manifest files from your selected study and file types from the GDC legacy archive (NOT GDC portal).

Then, use the following R script:

library(dplyr)
library(jsonlite)
legfnames = legacy[["file_id"]]
entities = legacy[["associated_entities"]]
IDconversion = bind_rows(entities, .id = "column_label")
IDconversion['legacy file names'] = legfnames

0
Entering edit mode

0
Entering edit mode

You obtain it from the GDC Data Portal by selecting the samples that you need and then downloading the JSON file:

0
Entering edit mode

@kelvin, Thanks for the reply. I have 350 samples. And I would like to have a more efficient method for all samples than manually downloading the metadata file. Also is there a way to have the metadata file in tsv fomart given the manifest data with the UUID's

0
Entering edit mode

Hey, yes, you can obtain a TSV file, too. Can you clarify what you are trying to convert, and to what you want to convert it?

0
Entering edit mode

Ok, I have 320 AML aligned exon files (BAM or *_gdc_realn.bam ) files, in addition to that, I have their manifest data with UUID, filename, md5, size, adn state. I would to have their metadat as tsv file. Currently, all I have is these BAM files and there manifest data. All I need is, the metadatafile with follwoing information,

study   center  tcga_id analysis_id accession   participant_id  sample_id   refassembly mark_duplicates exome_bed

0
Entering edit mode

For your study, there should be clinical data files that also provide a lot of information - these are available from the GDC, too. Filter for the BCR Biotab files.

There is likely a programmatic way, too, but I cannot think of one for now.

0
Entering edit mode

Thank you. I have downloaded the tsv files from clinical information from the portal directly. however, I have no information for columns, say for example, mark_duplicates refassembly etc.

0
Entering edit mode

I see. Information on those can likely be found in the SAM headers within each file. In addition, based on the overview of the DNA-seq analysis pipeline (HERE), it seems that PCR/optical duplicates are marked and that the ref assembly is GRCh38.d1.vd1

1
Entering edit mode

Ok, I see so such informatiin should be generated directly from the BAM file. Ok thanks.I thought I will have them seperetely as metedata. Ok Thanks for clarification

1
Entering edit mode

Well, just always be meticulous with the TCGA data, i.e., introduce a lot of QC checking to ensure that you have the correct data... a lot of the TCGA was produced and has been duplicated and re-processed many times.