I recently downloaded data from the GDC, and to keep track of which sample belongs to what study and have which sample type, I also created a small metadata file. However, it seems that some of the samples do not have a study type assigned - or I must have a strange side effect in my script. I have one data set where only one sample of the GBM study has no project_id, and a second data set (exact same parameters for download and preprocessing) where six samples have no project_id assigned. When looking up the TCGA barcode in the data portal, I can clearly see that the sample belongs to the GBM study.
Has anyone had similar experiences? I really have no clue what is going wrong here.
This is my code:
#! /usr/local/bin/Rscript library(TCGAbiolinks) library(dplyr) library(SummarizedExperiment)#for colData function library(DESeq2) query1 <- GDCquery(project= c("TCGA-GBM"), data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow = "HTSeq - Counts", barcode = c("")) #download the actual data GDCdownload(query1) #prepares the data for analysis and putting it into an SummarizedExperiment object data1 <- GDCprepare(query1) #create DESeqDataSet for further processing dds1 <- DESeqDataSet(data1, design = ~ 1) #filter out genes with >30% missing data thres <- (ncol(dds1) * 30) / 100 dds1 <- dds1[ rowSums(counts(dds1) > 0) > thres, ] #filter out samples with >30% missing data thres <- (nrow(dds1) * 30) / 100 dds1 <- dds1[ ,colSums(counts(dds1) > 0) > thres] #logtransform counts and normalize by library size (=cross-sample comparison) transdds1 <- vst(dds1) selected_features1 <- transdds1 #create metadata file metadata1 <- colData(transdds1) metadata1 <- metadata1[c("project_id", "shortLetterCode")] transp_metadata1 <-t(as.matrix(metadata1)) write.csv(transp_metadata1, file="./GDCdata/2017-11-14_09-54-20/TCGA-GBM__GeneExpressionQuantification_HTSeq-Counts_metadata.csv") #creates the gene expression matrix with genes as rows matrix1 <- t(assay(transdds1)) write.csv(matrix1, file="./GDCdata/2017-11-14_09-54-20/TCGA-GBM__GeneExpressionQuantification_HTSeq-Counts.csv")