Question: GDC portal barcode / metadata
0
gravatar for silvia.casola8
8 months ago by
silvia.casola80 wrote:

Hi,

I'm completely new to the gdc.cancer.gov portal and I need some help (I saw here are similar questions on this forum).

What I need to do is to download Gene Expression quantification data (using HTSeq-FPKM-UQ) for breast cancer and use these data to classify cancer subtypes (luminal A, B, HER2-like, basal-like).

To retrieve the labels I basically have 2 options (feel free to add more):

1) Get the sample id in the 'old' TCGA-barcode format (eg. "TCGA-AR-A1AL-01") and use a dictionary which I downloaded from an old article using the same data which directly maps barcode to subtype. The problem here is that I have no idea of how to get the TCGA-barcode format and it looks like the old API to do that does not work anymore.

2) Download the clinical data also and check the fields linked to ER, PgR, HER2 to manually assign labels. However, once I download the EXP data, I basically lose any metadata and I don't know how to join the two files (EXP, clinical) in order to assign labels. I know there must be a way of using API to do what I need.

Can someone more expert with the portal help me?

Thank you :)

rna-seq gdc tcga • 369 views
ADD COMMENTlink modified 8 months ago by noorpratap.singh270 • written 8 months ago by silvia.casola80
0
gravatar for noorpratap.singh
8 months ago by
India
noorpratap.singh270 wrote:

If you are familiar with R, then things are really easy. There is a package called TCGABiolinks. For your case examples 3 and 4 are useful. Make sure to use legacy = T, since I am not sure whether the subtypes exist for the updated data.

If you are uncomfortable with R then you have to download these two data (clinical and mRNA) separately, making sure that each meta data file is also downloaded along with it. Then a script has to be written to match the barcodes from both. Barcode would be typically like this (TCGA-G4-6317-02A-11D-2064-05), an example. If you delimit it by '-' then the first three would be characterisitic of a sample i.e TCGA-G4-6317 should be sufficient to define the patient. Extract the barcodes from both the data and look at the first three columns delimited by '-' as stated above in example. This would allow you to map samples from both the data. Hope this helps.

ADD COMMENTlink modified 8 months ago • written 8 months ago by noorpratap.singh270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 881 users visited in the last hour