Question: Sample names for TCGA data from GDC-legacy archive
3
gravatar for Vasu
15 months ago by
Vasu340
Vasu340 wrote:

Hi,

As I needed RNAseq raw sequencing data I downloaded the rnaseq manifest file from GDC legacy archive and with the token I downloaded rnaseq raw data.

The manifest looks like this:

id  filename    md5 size    state
d1017f74-3a39-4427-af57-273e34247b49    UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz   ed7f23aa9540ef0242cb6ddde30d1aca    5830465428  live
5e2d5c52-596f-49bc-967c-42129abbacbf    UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz  b1f03852b2ac3c3cd50cb4a87f2a116a    7587398372  live
2ef74f93-5da2-454c-aca2-d86c289eacb8    UNCID_2206802.25be50e7-7705-492d-a44a-0e40180d10c8.110901_UNC12-SN629_0127_BC025UABXX_1_CTTGTA.tar.gz   a965da78ada814a35702fd65209b500a    7867889236  live
e01ca3e0-beb0-46b7-bb7c-f5b16f966918    UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz   6e6a26fcce8e84d209b1475249a922de    5187498148  live
992a7083-28ce-4857-898e-9d4b4fbf2fa1    UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz   bb9e19a5f286ff37bf95cb0c307930ea    6717741168  live
230082b7-39ec-4fe1-b3c6-daf35458f396    UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz   d93777efebc921e2539aa2b7081da6d4    4766879929  live
9bbada51-d827-4eea-af45-47d7b5ba137e    UNCID_2206522.147d6ebb-7359-449a-9e6a-6c8443ebaa2e.110919_UNC13-SN749_0113_AB00WUABXX_3_CGATGT.tar.gz   7bcfe71256ba172fa605bc4ddc04f9c7    7606309309  live
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2    UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz   281083ac338f67145ede3d8ef3f4300f    7504345358  live

After the download I have folders with names present in "id" column. Inside each folder there is tar.gz file.

For eg:

d1017f74-3a39-4427-af57-273e34247b49
                       |___ UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz

When I extracted the tar.gz files I got the fastq files like below:

110801_UNC12-SN629_0115_BD0DVEABXX.3_1.fastq
110801_UNC12-SN629_0115_BD0DVEABXX.3_2.fastq

What is the sample name here? Looks very confused.

rna-seq gdc tcga • 4.1k views
ADD COMMENTlink modified 4 months ago by david.peeney20 • written 15 months ago by Vasu340
2

As this is controlled data, you could log in at the main GDC ( https://portal.gdc.cancer.gov/ ) and use the search box to search for the file-names - they should be there. You would then obviously look for the UUID or TCGA barcode.

To do this programmatically, there are APIs but, the last time that I tried them, they were offline. There has been a lot of data being moved around relatively recently for the TCGA. One way that I did it was to download the JSON manifest for my data (from the Legacy Archive) and then use a loop in R to pull out the CASE ID (which is the UUID), in this case, which I then used to identify the patients. Here's the loop that I used (slow; sample filenames are in filenames object):

require(rjson)
manifest <- fromJSON(file="RNAseqManifest.json")

#Look up each filename's UUID from the manifest
fileUUIDs <- c()
for (i in 1:length(filenames))
{
    record <- manifest[[grep(filenames[i], manifest, fixed=TRUE, ignore.case=FALSE)]]

    if (filenames[i]!=record$file_name)
    {
        print("FALSE")
    }

    fileUUIDs[i] <- record$cases[[1]]$case_id
}
ADD REPLYlink modified 14 months ago • written 15 months ago by Kevin Blighe45k

sorry, didn't get what is filenames object. what is sample filenames?

ADD REPLYlink written 15 months ago by Vasu340

Just a vector of your filenames, such as:

filenames <- c("UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz",
  ...,
  "UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz")
ADD REPLYlink modified 8 months ago • written 15 months ago by Kevin Blighe45k

As given above in the manifest I already have UUID. What I need is TCGA sample name. For this do I need to login into GDC and search?

ADD REPLYlink written 15 months ago by Vasu340

I see. For UUID-to-TCGA barcode mapping, I was able to just use one of the clinical data files in BioTab format (also available at Legacy Archive).

For example, here is the file for breast cancer: https://portal.gdc.cancer.gov/legacy-archive/files/735bc5...

The first 2 columns of that file are:

  • bcr_patient_uuid
  • bcr_patient_barcode
ADD REPLYlink modified 8 months ago • written 15 months ago by Kevin Blighe45k
5
gravatar for Vasu
15 months ago by
Vasu340
Vasu340 wrote:

Best way to do it.

library(GenomicDataCommons)
manifest <- read.table("gdc_manifest_rnaseq_fastq.txt")

manifest:

id  filename    md5 size    state
d1017f74-3a39-4427-af57-273e34247b49    UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz   ed7f23aa9540ef0242cb6ddde30d1aca    5830465428  live
5e2d5c52-596f-49bc-967c-42129abbacbf    UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz  b1f03852b2ac3c3cd50cb4a87f2a116a    7587398372  live
2ef74f93-5da2-454c-aca2-d86c289eacb8    UNCID_2206802.25be50e7-7705-492d-a44a-0e40180d10c8.110901_UNC12-SN629_0127_BC025UABXX_1_CTTGTA.tar.gz   a965da78ada814a35702fd65209b500a    7867889236  live
e01ca3e0-beb0-46b7-bb7c-f5b16f966918    UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz   6e6a26fcce8e84d209b1475249a922de    5187498148  live
992a7083-28ce-4857-898e-9d4b4fbf2fa1    UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz   bb9e19a5f286ff37bf95cb0c307930ea    6717741168  live
230082b7-39ec-4fe1-b3c6-daf35458f396    UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz   d93777efebc921e2539aa2b7081da6d4    4766879929  live
9bbada51-d827-4eea-af45-47d7b5ba137e    UNCID_2206522.147d6ebb-7359-449a-9e6a-6c8443ebaa2e.110919_UNC13-SN749_0113_AB00WUABXX_3_CGATGT.tar.gz   7bcfe71256ba172fa605bc4ddc04f9c7    7606309309  live
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2    UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz   281083ac338f67145ede3d8ef3f4300f    7504345358  live

file_uuids <- manifest$id
head(file_uuids)

d1017f74-3a39-4427-af57-273e34247b49
5e2d5c52-596f-49bc-967c-42129abbacbf
2ef74f93-5da2-454c-aca2-d86c289eacb8
e01ca3e0-beb0-46b7-bb7c-f5b16f966918
992a7083-28ce-4857-898e-9d4b4fbf2fa1
230082b7-39ec-4fe1-b3c6-daf35458f396
9bbada51-d827-4eea-af45-47d7b5ba137e
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2

library(GenomicDataCommons)
library(magrittr)

TCGAtranslateID = function(file_ids, legacy = TRUE) {
    info = files(legacy = legacy) %>%
        filter( ~ file_id %in% file_ids) %>%
        select('cases.samples.submitter_id') %>%
        results_all()
    # The mess of code below is to extract TCGA barcodes
    # id_list will contain a list (one item for each file_id)
    # of TCGA barcodes of the form 'TCGA-XX-YYYY-ZZZ'
    id_list = lapply(info$cases,function(a) {
        a[[1]][[1]][[1]]})
    # so we can later expand to a data.frame of the right size
    barcodes_per_file = sapply(id_list,length)
    # And build the data.frame
    return(data.frame(file_id = rep(ids(info),barcodes_per_file),
                      submitter_id = unlist(id_list)))
    }

res = TCGAtranslateID(file_uuids)
head(res)

file_id                                   Submitter_id       
d1017f74-3a39-4427-af57-273e34247b49    TCGA-E9-A1NA-11A
5e2d5c52-596f-49bc-967c-42129abbacbf    TCGA-AO-A12H-01A
2ef74f93-5da2-454c-aca2-d86c289eacb8    TCGA-AC-A23E-01A

I found the soution from seandavis blog https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/

ADD COMMENTlink modified 8 months ago • written 15 months ago by Vasu340
3

For the record, Bioinfo's answer is Sean Davis's blog post which can be seen here: https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/

If you are using content that is not yours, please cite it.

Also, we've added this functionality with Sean's permission to the TCGAutils package on Bioconductor.

Best regards, Marcel

ADD REPLYlink written 8 months ago by mramos14830

Thank you for catching that, Marcel.

ADD REPLYlink modified 8 months ago • written 8 months ago by Kevin Blighe45k

I just posted the answer how I solved the problem..Yes may be I must have posted the link. I totally forgot about that while posting the solution. Sorry for that.

ADD REPLYlink modified 8 months ago • written 8 months ago by Vasu340
1

Yes, you most definitely should have posted the link especially if the content was directly taken from the blog. You may not have intended to omit the attribution, but that omission makes you look bad. Gotta be extra careful, unfortunately.

ADD REPLYlink written 8 months ago by RamRS22k

Sure, I will be little careful from next time. thanq

ADD REPLYlink written 8 months ago by Vasu340
1

That's pretty cool - I've moved this to an answer. Please feel free to Accept it, as this will help others.

ADD REPLYlink written 15 months ago by Kevin Blighe45k

However, I received following error in Rstudio

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('gdc_files', 'GDCQuery', 'list')"

ADD REPLYlink written 13 months ago by Björn40

It should be "filter", not "filter_"

ADD REPLYlink written 13 months ago by Vasu340

Bioinfo, this exact function was working 2 weeks ago. Something has changed, either at the GDC server or in the version of the GenomicDataCommons package (or elsewhere)

I was trying it yesterday and it neither worked, unfortunately (tried from 2 different places).

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe45k

Just now I gave a try with the above mentioned code. It worked for me.

ADD REPLYlink written 13 months ago by Vasu340

I tried it just now and it now gives this:

Error in .gdc_post(entity_name(x), body = body, legacy = x$legacy, token = NULL,  :
  Not Found (HTTP 404).
In addition: Warning message:
In strptime(x, fmt, tz = "GMT") :
  unknown timezone 'zone/tz/2018c.1.0/zoneinfo/Europe/London'
ADD REPLYlink written 13 months ago by Kevin Blighe45k

It could be due to the versions. Not sure. These are the versions I'm using.

GenomicDataCommons_1.5.3
BiocInstaller_1.31.1
R version 3.5.0
ADD REPLYlink written 13 months ago by Vasu340
2

I modified the function to accept filenames, too: C: problem in matching the names between file names and patients Id in TCGA

Thanks.

ADD REPLYlink modified 8 months ago • written 13 months ago by Kevin Blighe45k
2

Also got the function working again by updating directly from GitHub:

require(devtools)
install_github("Bioconductor/GenomicDataCommons")
ADD REPLYlink written 13 months ago by Kevin Blighe45k
1

hi,

I tried your code but getting error like this Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: gdc-api.nci.nih.gov

Any suggestion or help is much appreciated.

Thanks

ADD REPLYlink written 10 months ago by archana.bioinfo87160
1

Try to install the development version of GenomicDataCommons:

require(devtools)
install_github("Bioconductor/GenomicDataCommons")
ADD REPLYlink written 10 months ago by Kevin Blighe45k
1
gravatar for david.peeney
4 months ago by
david.peeney20
david.peeney20 wrote:

copying my answer from Tutorial: TCGA UUIDS to TCGA barcode (SampleID) in R

Using the GenomicDataCommons method only gives you short barcodes (for identifying patients), which is not particularly useful when dealing with duplicate samples. A good method I found that gives me harmonized UUIDs and full aliquot barcodes is:

Firstly, you need to download the JSON manifest files from your selected study and file types from the GDC legacy archive (NOT GDC portal).

Then, use the following R script:

library(dplyr)
library(jsonlite)
legacy = fromJSON(txt = "~/Downloads/metadata.cart.2019-03-07 (1).json")
legfnames = legacy[["file_id"]]
entities = legacy[["associated_entities"]]
IDconversion = bind_rows(entities, .id = "column_label")
IDconversion['legacy file names'] = legfnames
ADD COMMENTlink written 4 months ago by david.peeney20

@david.peeney, How and where did you download the metadata.cart* from GDC?

ADD REPLYlink written 7 days ago by a.james210

You obtain it from the GDC Data Portal by selecting the samples that you need and then downloading the JSON file: f

ADD REPLYlink written 7 days ago by Kevin Blighe45k

@kelvin, Thanks for the reply. I have 350 samples. And I would like to have a more efficient method for all samples than manually downloading the metadata file. Also is there a way to have the metadata file in tsv fomart given the manifest data with the UUID's

ADD REPLYlink modified 7 days ago • written 7 days ago by a.james210

Hey, yes, you can obtain a TSV file, too. Can you clarify what you are trying to convert, and to what you want to convert it?

ADD REPLYlink written 7 days ago by Kevin Blighe45k

Ok, I have 320 AML aligned exon files (BAM or *_gdc_realn.bam ) files, in addition to that, I have their manifest data with UUID, filename, md5, size, adn state. I would to have their metadat as tsv file. Currently, all I have is these BAM files and there manifest data. All I need is, the metadatafile with follwoing information,

study   center  tcga_id analysis_id accession   participant_id  sample_id   refassembly mark_duplicates exome_bed
ADD REPLYlink modified 7 days ago • written 7 days ago by a.james210

For your study, there should be clinical data files that also provide a lot of information - these are available from the GDC, too. Filter for the BCR Biotab files.

There is likely a programmatic way, too, but I cannot think of one for now.

ADD REPLYlink written 7 days ago by Kevin Blighe45k

Thank you. I have downloaded the tsv files from clinical information from the portal directly. however, I have no information for columns, say for example, mark_duplicates refassembly etc.

ADD REPLYlink written 7 days ago by a.james210

I see. Information on those can likely be found in the SAM headers within each file. In addition, based on the overview of the DNA-seq analysis pipeline (HERE), it seems that PCR/optical duplicates are marked and that the ref assembly is GRCh38.d1.vd1

ADD REPLYlink written 7 days ago by Kevin Blighe45k
1

Ok, I see so such informatiin should be generated directly from the BAM file. Ok thanks.I thought I will have them seperetely as metedata. Ok Thanks for clarification

ADD REPLYlink written 7 days ago by a.james210
1

Well, just always be meticulous with the TCGA data, i.e., introduce a lot of QC checking to ensure that you have the correct data... a lot of the TCGA was produced and has been duplicated and re-processed many times.

ADD REPLYlink written 7 days ago by Kevin Blighe45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1559 users visited in the last hour