Question: GDC portal barcode / metadata
gravatar for silvia.casola8
2.2 years ago by
silvia.casola80 wrote:


I'm completely new to the portal and I need some help (I saw here are similar questions on this forum).

What I need to do is to download Gene Expression quantification data (using HTSeq-FPKM-UQ) for breast cancer and use these data to classify cancer subtypes (luminal A, B, HER2-like, basal-like).

To retrieve the labels I basically have 2 options (feel free to add more):

1) Get the sample id in the 'old' TCGA-barcode format (eg. "TCGA-AR-A1AL-01") and use a dictionary which I downloaded from an old article using the same data which directly maps barcode to subtype. The problem here is that I have no idea of how to get the TCGA-barcode format and it looks like the old API to do that does not work anymore.

2) Download the clinical data also and check the fields linked to ER, PgR, HER2 to manually assign labels. However, once I download the EXP data, I basically lose any metadata and I don't know how to join the two files (EXP, clinical) in order to assign labels. I know there must be a way of using API to do what I need.

Can someone more expert with the portal help me?

Thank you :)

rna-seq gdc tcga • 1.2k views
ADD COMMENTlink modified 2.2 years ago by noorpratap.singh300 • written 2.2 years ago by silvia.casola80
gravatar for noorpratap.singh
2.2 years ago by
University of Maryland
noorpratap.singh300 wrote:

If you are familiar with R, then things are really easy. There is a package called TCGABiolinks. For your case examples 3 and 4 are useful. Make sure to use legacy = T, since I am not sure whether the subtypes exist for the updated data.

If you are uncomfortable with R then you have to download these two data (clinical and mRNA) separately, making sure that each meta data file is also downloaded along with it. Then a script has to be written to match the barcodes from both. Barcode would be typically like this (TCGA-G4-6317-02A-11D-2064-05), an example. If you delimit it by '-' then the first three would be characterisitic of a sample i.e TCGA-G4-6317 should be sufficient to define the patient. Extract the barcodes from both the data and look at the first three columns delimited by '-' as stated above in example. This would allow you to map samples from both the data. Hope this helps.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by noorpratap.singh300
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1735 users visited in the last hour