GDC TCGA Metadata no project ID available?
0
0
Entering edit mode
4.2 years ago

Hi there,

I recently downloaded data from the GDC, and to keep track of which sample belongs to what study and have which sample type, I also created a small metadata file. However, it seems that some of the samples do not have a study type assigned - or I must have a strange side effect in my script. I have one data set where only one sample of the GBM study has no project_id, and a second data set (exact same parameters for download and preprocessing) where six samples have no project_id assigned. When looking up the TCGA barcode in the data portal, I can clearly see that the sample belongs to the GBM study.

Has anyone had similar experiences? I really have no clue what is going wrong here.

This is my code:

#! /usr/local/bin/Rscript
library(dplyr)
library(SummarizedExperiment)#for colData function
library(DESeq2)

query1 <- GDCquery(project= c("TCGA-GBM"), data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow = "HTSeq - Counts", barcode = c(""))

#prepares the data for analysis and putting it into an SummarizedExperiment object
data1 <- GDCprepare(query1)

#create DESeqDataSet for further processing
dds1 <- DESeqDataSet(data1, design = ~ 1)

#filter out genes with >30% missing data
thres <- (ncol(dds1) * 30) / 100
dds1 <- dds1[ rowSums(counts(dds1) > 0) > thres, ]

#filter out samples with >30% missing data
thres <- (nrow(dds1) * 30) / 100
dds1 <- dds1[ ,colSums(counts(dds1) > 0) > thres]

#logtransform counts and normalize by library size (=cross-sample comparison)
transdds1 <- vst(dds1)
selected_features1 <- transdds1

#creates the gene expression matrix with genes as rows
matrix1 <- t(assay(transdds1))
write.csv(matrix1, file="./GDCdata/2017-11-14_09-54-20/TCGA-GBM__GeneExpressionQuantification_HTSeq-Counts.csv")

gene expression GDC TCGA metadata • 2.0k views
0
Entering edit mode

When you say project_id, what exactly do you mean? - TCGA-GBM? If it is this ID, then it could be that the samples without the project_id were not included in the final published manuscript for GBM.

Be sure also that these are not 'normal' samples. You can usually tell this from the TCGA barcode but, if not, go here and search for the sample with the UUID. For example, I I search for case UUID 336a3daa-0d18-4b1d-a091-b96f864d022a (TCGA barcode TCGA-AB-2904-11A), I can see that it's 'Solid Tissue Normal'

0
Entering edit mode

Hi Kevin, yes - I mean the study type when talking about project_id (this is the attribute in the colData in which it is encoded). Currently, I have 6 samples with no study type. Some of them are normal tissues (TN), some of them cancer tissues (TP). I cannot really find any pattern here. Additionally, when I downloaded the data set a week ago, I had only one sample with missing project_id (it is also missing in the recently downloaded data set) - now I have six (but it currently stays six). Could it be that they have adapted the data set meanwhile (for whatever reason)? Is there any page where they release notifications about that?

These are the barcodes with missing project_id: TCGA-06-0678-11A-32R-A36H-07, TCGA-06-0680-11A-32R-A36H-07, TCGA-06-AABW-11A-31R-A36H-07, TCGA-28-2510-01A-01R-1850-01, TCGA-06-0681-11A-41R-A36H-07, TCGA-06-0675-11A-32R-A36H-07

Best,

Cindy

0
Entering edit mode

Hi Cindy, yes, 5 of those 6 are normals (I can immediately tell by the '11A' in the TCGA barcodes) - the other (with '01A') is a tumour, though.

Could it be that they have adapted the data set meanwhile (for whatever reason)? Is there any page where they release notifications about that?

Yes, this does happen, unfortunately, and it is highly undesirable. What's worse is that these programs link to online database, so, even using the same version of the program may result in different results if the developers update their database.

When I analyse TCGA data, I never use any of these programs. I download the data myself so that the data and versions are 'fixed' and don't change. For clinical data, you can obtain it from the GDC Legacy Archive. Here is a page configured for all TCGA-GBM clinical data. The patient data that you need will most likely be the one called nationwidechildrens.org_clinical_patient_gbm.txt, however, you'll have to input this and link it to your current data.

0
Entering edit mode

As it here is only about a couple of barcodes with missing project_id, as we can look up the assigned study of them, and as we have no idea why this happens - I would now just try to fix the issue by manually filling in the project_id. Any concerns with this?

Best,

Cindy

0
Entering edit mode

0
Entering edit mode

Hi bharata,

no, I have used TCGAbiolinks, who use the GDC API for querying. When downloading, they also get the metadata (the whole object is a SummarizedExperiment that contains both the expression counts and metadata) I can then access.

Best,

Cindy

0
Entering edit mode

I see. If it is from GDC, maybe you can compare if you download manually from the portal. You can see whether the project_id exists in the original json format or not, if you still curious though.

0
Entering edit mode

This is the metadata that I downloaded manually from GDC and then use my own parser: csv file

0
Entering edit mode

I don't believe that there are any concerns. If they were removed as outliers, then you will notice that later on when checking