Question: Sample names for TCGA data from GDC-legacy archive
3
gravatar for Bioinfo
8 months ago by
Bioinfo270
Bioinfo270 wrote:

Hi,

As I needed RNAseq raw sequencing data I downloaded the rnaseq manifest file from GDC legacy archive and with the token I downloaded rnaseq raw data.

The manifest looks like this:

id  filename    md5 size    state
d1017f74-3a39-4427-af57-273e34247b49    UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz   ed7f23aa9540ef0242cb6ddde30d1aca    5830465428  live
5e2d5c52-596f-49bc-967c-42129abbacbf    UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz  b1f03852b2ac3c3cd50cb4a87f2a116a    7587398372  live
2ef74f93-5da2-454c-aca2-d86c289eacb8    UNCID_2206802.25be50e7-7705-492d-a44a-0e40180d10c8.110901_UNC12-SN629_0127_BC025UABXX_1_CTTGTA.tar.gz   a965da78ada814a35702fd65209b500a    7867889236  live
e01ca3e0-beb0-46b7-bb7c-f5b16f966918    UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz   6e6a26fcce8e84d209b1475249a922de    5187498148  live
992a7083-28ce-4857-898e-9d4b4fbf2fa1    UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz   bb9e19a5f286ff37bf95cb0c307930ea    6717741168  live
230082b7-39ec-4fe1-b3c6-daf35458f396    UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz   d93777efebc921e2539aa2b7081da6d4    4766879929  live
9bbada51-d827-4eea-af45-47d7b5ba137e    UNCID_2206522.147d6ebb-7359-449a-9e6a-6c8443ebaa2e.110919_UNC13-SN749_0113_AB00WUABXX_3_CGATGT.tar.gz   7bcfe71256ba172fa605bc4ddc04f9c7    7606309309  live
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2    UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz   281083ac338f67145ede3d8ef3f4300f    7504345358  live

After the download I have folders with names present in "id" column. Inside each folder there is tar.gz file.

For eg:

d1017f74-3a39-4427-af57-273e34247b49
                       |___ UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz

When I extracted the tar.gz files I got the fastq files like below:

110801_UNC12-SN629_0115_BD0DVEABXX.3_1.fastq
110801_UNC12-SN629_0115_BD0DVEABXX.3_2.fastq

What is the sample name here? Looks very confused.

rna-seq gdc tcga • 2.0k views
ADD COMMENTlink modified 8 months ago by Kevin Blighe33k • written 8 months ago by Bioinfo270
2

As this is controlled data, you could log in at the main GDC ( https://portal.gdc.cancer.gov/ ) and use the search box to search for the file-names - they should be there. You would then obviously look for the UUID or TCGA barcode.

To do this programmatically, there are APIs but, the last time that I tried them, they were offline. There has been a lot of data being moved around relatively recently for the TCGA. One way that I did it was to download the JSON manifest for my data (from the Legacy Archive) and then use a loop in R to pull out the CASE ID (which is the UUID), in this case, which I then used to identify the patients. Here's the loop that I used (slow; sample filenames are in filenames object):

require(rjson)
manifest <- fromJSON(file="RNAseqManifest.json")

#Look up each filename's UUID from the manifest
fileUUIDs <- c()
for (i in 1:length(filenames))
{
    record <- manifest[[grep(filenames[i], manifest, fixed=TRUE, ignore.case=FALSE)]]

    if (filenames[i]!=record$file_name)
    {
        print("FALSE")
    }

    fileUUIDs[i] <- record$cases[[1]]$case_id
}
ADD REPLYlink modified 7 months ago • written 8 months ago by Kevin Blighe33k

sorry, didn't get what is filenames object. what is sample filenames?

ADD REPLYlink written 8 months ago by Bioinfo270

Just a vector of your filenames, such as:

filenames <- c("UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz",
  ...,
  "UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz")
ADD REPLYlink modified 6 weeks ago • written 8 months ago by Kevin Blighe33k

As given above in the manifest I already have UUID. What I need is TCGA sample name. For this do I need to login into GDC and search?

ADD REPLYlink written 8 months ago by Bioinfo270

I see. For UUID-to-TCGA barcode mapping, I was able to just use one of the clinical data files in BioTab format (also available at Legacy Archive).

For example, here is the file for breast cancer: https://portal.gdc.cancer.gov/legacy-archive/files/735bc5...

The first 2 columns of that file are:

  • bcr_patient_uuid
  • bcr_patient_barcode
ADD REPLYlink modified 6 weeks ago • written 8 months ago by Kevin Blighe33k
5
gravatar for Bioinfo
8 months ago by
Bioinfo270
Bioinfo270 wrote:

Best way to do it.

library(GenomicDataCommons)
manifest <- read.table("gdc_manifest_rnaseq_fastq.txt")

manifest:

id  filename    md5 size    state
d1017f74-3a39-4427-af57-273e34247b49    UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz   ed7f23aa9540ef0242cb6ddde30d1aca    5830465428  live
5e2d5c52-596f-49bc-967c-42129abbacbf    UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz  b1f03852b2ac3c3cd50cb4a87f2a116a    7587398372  live
2ef74f93-5da2-454c-aca2-d86c289eacb8    UNCID_2206802.25be50e7-7705-492d-a44a-0e40180d10c8.110901_UNC12-SN629_0127_BC025UABXX_1_CTTGTA.tar.gz   a965da78ada814a35702fd65209b500a    7867889236  live
e01ca3e0-beb0-46b7-bb7c-f5b16f966918    UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz   6e6a26fcce8e84d209b1475249a922de    5187498148  live
992a7083-28ce-4857-898e-9d4b4fbf2fa1    UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz   bb9e19a5f286ff37bf95cb0c307930ea    6717741168  live
230082b7-39ec-4fe1-b3c6-daf35458f396    UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz   d93777efebc921e2539aa2b7081da6d4    4766879929  live
9bbada51-d827-4eea-af45-47d7b5ba137e    UNCID_2206522.147d6ebb-7359-449a-9e6a-6c8443ebaa2e.110919_UNC13-SN749_0113_AB00WUABXX_3_CGATGT.tar.gz   7bcfe71256ba172fa605bc4ddc04f9c7    7606309309  live
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2    UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz   281083ac338f67145ede3d8ef3f4300f    7504345358  live

file_uuids <- manifest$id
head(file_uuids)

d1017f74-3a39-4427-af57-273e34247b49
5e2d5c52-596f-49bc-967c-42129abbacbf
2ef74f93-5da2-454c-aca2-d86c289eacb8
e01ca3e0-beb0-46b7-bb7c-f5b16f966918
992a7083-28ce-4857-898e-9d4b4fbf2fa1
230082b7-39ec-4fe1-b3c6-daf35458f396
9bbada51-d827-4eea-af45-47d7b5ba137e
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2

library(GenomicDataCommons)
library(magrittr)

TCGAtranslateID = function(file_ids, legacy = TRUE) {
    info = files(legacy = legacy) %>%
        filter( ~ file_id %in% file_ids) %>%
        select('cases.samples.submitter_id') %>%
        results_all()
    # The mess of code below is to extract TCGA barcodes
    # id_list will contain a list (one item for each file_id)
    # of TCGA barcodes of the form 'TCGA-XX-YYYY-ZZZ'
    id_list = lapply(info$cases,function(a) {
        a[[1]][[1]][[1]]})
    # so we can later expand to a data.frame of the right size
    barcodes_per_file = sapply(id_list,length)
    # And build the data.frame
    return(data.frame(file_id = rep(ids(info),barcodes_per_file),
                      submitter_id = unlist(id_list)))
    }

res = TCGAtranslateID(file_uuids)
head(res)

file_id                                   Submitter_id       
d1017f74-3a39-4427-af57-273e34247b49    TCGA-E9-A1NA-11A
5e2d5c52-596f-49bc-967c-42129abbacbf    TCGA-AO-A12H-01A
2ef74f93-5da2-454c-aca2-d86c289eacb8    TCGA-AC-A23E-01A

I found the soution from seandavis blog https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/

ADD COMMENTlink modified 5 weeks ago • written 8 months ago by Bioinfo270
3

For the record, Bioinfo's answer is Sean Davis's blog post which can be seen here: https://seandavi.github.io/post/2017/12/genomicdatacommons-example-uuid-to-tcga-and-target-barcode-translation/

If you are using content that is not yours, please cite it.

Also, we've added this functionality with Sean's permission to the TCGAutils package on Bioconductor.

Best regards, Marcel

ADD REPLYlink written 5 weeks ago by mramos14830

Thank you for catching that, Marcel.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe33k

I just posted the answer how I solved the problem..Yes may be I must have posted the link. I totally forgot about that while posting the solution. Sorry for that.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Bioinfo270
1

Yes, you most definitely should have posted the link especially if the content was directly taken from the blog. You may not have intended to omit the attribution, but that omission makes you look bad. Gotta be extra careful, unfortunately.

ADD REPLYlink written 5 weeks ago by RamRS19k

Sure, I will be little careful from next time. thanq

ADD REPLYlink written 5 weeks ago by Bioinfo270
1

That's pretty cool - I've moved this to an answer. Please feel free to Accept it, as this will help others.

ADD REPLYlink written 8 months ago by Kevin Blighe33k

However, I received following error in Rstudio

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('gdc_files', 'GDCQuery', 'list')"

ADD REPLYlink written 6 months ago by Björn30

It should be "filter", not "filter_"

ADD REPLYlink written 6 months ago by Bioinfo270

Bioinfo, this exact function was working 2 weeks ago. Something has changed, either at the GDC server or in the version of the GenomicDataCommons package (or elsewhere)

I was trying it yesterday and it neither worked, unfortunately (tried from 2 different places).

ADD REPLYlink modified 6 months ago • written 6 months ago by Kevin Blighe33k

Just now I gave a try with the above mentioned code. It worked for me.

ADD REPLYlink written 6 months ago by Bioinfo270

I tried it just now and it now gives this:

Error in .gdc_post(entity_name(x), body = body, legacy = x$legacy, token = NULL,  :
  Not Found (HTTP 404).
In addition: Warning message:
In strptime(x, fmt, tz = "GMT") :
  unknown timezone 'zone/tz/2018c.1.0/zoneinfo/Europe/London'
ADD REPLYlink written 6 months ago by Kevin Blighe33k

It could be due to the versions. Not sure. These are the versions I'm using.

GenomicDataCommons_1.5.3
BiocInstaller_1.31.1
R version 3.5.0
ADD REPLYlink written 6 months ago by Bioinfo270
2

I modified the function to accept filenames, too: C: problem in matching the names between file names and patients Id in TCGA

Thanks.

ADD REPLYlink modified 5 weeks ago • written 6 months ago by Kevin Blighe33k
2

Also got the function working again by updating directly from GitHub:

require(devtools)
install_github("Bioconductor/GenomicDataCommons")
ADD REPLYlink written 6 months ago by Kevin Blighe33k
1

hi,

I tried your code but getting error like this Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: gdc-api.nci.nih.gov

Any suggestion or help is much appreciated.

Thanks

ADD REPLYlink written 3 months ago by archana.bioinfo87100
1

Try to install the development version of GenomicDataCommons:

require(devtools)
install_github("Bioconductor/GenomicDataCommons")
ADD REPLYlink written 3 months ago by Kevin Blighe33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1252 users visited in the last hour