Question: GDC TCGA Metadata no project ID available?
0
gravatar for cindy.perscheid
9 days ago by
Hasso Plattner Institute, Potsdam, Germany
cindy.perscheid30 wrote:

Hi there,

I recently downloaded data from the GDC, and to keep track of which sample belongs to what study and have which sample type, I also created a small metadata file. However, it seems that some of the samples do not have a study type assigned - or I must have a strange side effect in my script. I have one data set where only one sample of the GBM study has no project_id, and a second data set (exact same parameters for download and preprocessing) where six samples have no project_id assigned. When looking up the TCGA barcode in the data portal, I can clearly see that the sample belongs to the GBM study.

Has anyone had similar experiences? I really have no clue what is going wrong here.

This is my code:

#! /usr/local/bin/Rscript
library(TCGAbiolinks)
library(dplyr)
library(SummarizedExperiment)#for colData function
library(DESeq2)

query1 <- GDCquery(project= c("TCGA-GBM"), data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow = "HTSeq - Counts", barcode = c(""))

#download the actual data
GDCdownload(query1)

#prepares the data for analysis and putting it into an SummarizedExperiment object
data1 <- GDCprepare(query1)

#create DESeqDataSet for further processing
dds1 <- DESeqDataSet(data1, design = ~ 1)

#filter out genes with >30% missing data
thres <- (ncol(dds1) * 30) / 100
dds1 <- dds1[ rowSums(counts(dds1) > 0) > thres, ]

#filter out samples with >30% missing data
thres <- (nrow(dds1) * 30) / 100
dds1 <- dds1[ ,colSums(counts(dds1) > 0) > thres]

#logtransform counts and normalize by library size (=cross-sample comparison)
transdds1 <- vst(dds1)
selected_features1 <- transdds1

#create metadata file
metadata1 <- colData(transdds1)
metadata1 <- metadata1[c("project_id", "shortLetterCode")]
transp_metadata1 <-t(as.matrix(metadata1))
write.csv(transp_metadata1, file="./GDCdata/2017-11-14_09-54-20/TCGA-GBM__GeneExpressionQuantification_HTSeq-Counts_metadata.csv")

#creates the gene expression matrix with genes as rows
matrix1 <- t(assay(transdds1))
write.csv(matrix1, file="./GDCdata/2017-11-14_09-54-20/TCGA-GBM__GeneExpressionQuantification_HTSeq-Counts.csv")
ADD COMMENTlink modified 8 days ago • written 9 days ago by cindy.perscheid30

When you say project_id, what exactly do you mean? - TCGA-GBM? If it is this ID, then it could be that the samples without the project_id were not included in the final published manuscript for GBM.

Be sure also that these are not 'normal' samples. You can usually tell this from the TCGA barcode but, if not, go here and search for the sample with the UUID. For example, I I search for case UUID 336a3daa-0d18-4b1d-a091-b96f864d022a (TCGA barcode TCGA-AB-2904-11A), I can see that it's 'Solid Tissue Normal'

ADD REPLYlink written 9 days ago by Kevin Blighe7.2k

Hi Kevin, yes - I mean the study type when talking about project_id (this is the attribute in the colData in which it is encoded). Currently, I have 6 samples with no study type. Some of them are normal tissues (TN), some of them cancer tissues (TP). I cannot really find any pattern here. Additionally, when I downloaded the data set a week ago, I had only one sample with missing project_id (it is also missing in the recently downloaded data set) - now I have six (but it currently stays six). Could it be that they have adapted the data set meanwhile (for whatever reason)? Is there any page where they release notifications about that?

These are the barcodes with missing project_id: TCGA-06-0678-11A-32R-A36H-07, TCGA-06-0680-11A-32R-A36H-07, TCGA-06-AABW-11A-31R-A36H-07, TCGA-28-2510-01A-01R-1850-01, TCGA-06-0681-11A-41R-A36H-07, TCGA-06-0675-11A-32R-A36H-07

Best,

Cindy

ADD REPLYlink written 9 days ago by cindy.perscheid30

Hi Cindy, yes, 5 of those 6 are normals (I can immediately tell by the '11A' in the TCGA barcodes) - the other (with '01A') is a tumour, though.

Could it be that they have adapted the data set meanwhile (for whatever reason)? Is there any page where they release notifications about that?

Yes, this does happen, unfortunately, and it is highly undesirable. What's worse is that these programs link to online database, so, even using the same version of the program may result in different results if the developers update their database.

When I analyse TCGA data, I never use any of these programs. I download the data myself so that the data and versions are 'fixed' and don't change. For clinical data, you can obtain it from the GDC Legacy Archive. Here is a page configured for all TCGA-GBM clinical data. The patient data that you need will most likely be the one called nationwidechildrens.org_clinical_patient_gbm.txt, however, you'll have to input this and link it to your current data.

Another place where you can download data is cBioPortal

ADD REPLYlink modified 9 days ago • written 9 days ago by Kevin Blighe7.2k

As it here is only about a couple of barcodes with missing project_id, as we can look up the assigned study of them, and as we have no idea why this happens - I would now just try to fix the issue by manually filling in the project_id. Any concerns with this?

Best,

Cindy

ADD REPLYlink written 8 days ago by cindy.perscheid30

Have you downloaded metadata from the GDC website itself? It is in the json format.

ADD REPLYlink written 8 days ago by bharata1803230

Hi bharata,

no, I have used TCGAbiolinks, who use the GDC API for querying. When downloading, they also get the metadata (the whole object is a SummarizedExperiment that contains both the expression counts and metadata) I can then access.

Best,

Cindy

ADD REPLYlink written 8 days ago by cindy.perscheid30

I see. If it is from GDC, maybe you can compare if you download manually from the portal. You can see whether the project_id exists in the original json format or not, if you still curious though.

ADD REPLYlink written 8 days ago by bharata1803230

This is the metadata that I downloaded manually from GDC and then use my own parser: csv file

ADD REPLYlink written 8 days ago by bharata1803230

I don't believe that there are any concerns. If they were removed as outliers, then you will notice that later on when checking

ADD REPLYlink modified 8 days ago • written 8 days ago by Kevin Blighe7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1379 users visited in the last hour