Hello,
I am trying to intersect / match data available for the TCGA-BRCA project (open access only). To be more precise:
1. I did look for mutations in a particular gene using GDC web portal. This gave me (simplified):
Case_ID Project
TCGA-BH-A2L8 TCGA-BRCA
TCGA-AR-A1AO TCGA-BRCA
TCGA-A2-A1FZ TCGA-BRCA
2. extracted Case_ID and used it in R TCGAbiolinks
:
case_id_list <- c("TCGA-BH-A2L8", "TCGA-AR-A1AO")
query_expression <- GDCquery(project = "TCGA-BRCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
barcode = case_id_list,
experimental.strategy = "RNA-Seq",
sample.type = c("Primary Tumor","Solid Tissue Normal"))
GDCdownload(query_expression)
The case_id_list was longer. and I got bunch of dirs folders in .//GDCdata/TCGA-BRCA/Transcriptome_Profiling/Gene_Expression_Quantification/
which correspond to file_id
, such as 0877bc64-fbf4-427f-a889-21a4a9102600
My problems:
A) given TSV file ID find out the case_ID/barcode/whatever letting me figure out to which case_ID a given TSV file belongs to; and
B) While I can redo the download of TSV expression files, how do I get the info about the data being from "Primary Tumor" or "Solid Tissue Normal"?
I have tried:
library("GenomicDataCommons")
ge_manifest <- files() %>%
filter( cases.project.project_id == 'TCGA-BRCA') %>%
filter( type == 'gene_expression' ) %>%
filter( analysis.workflow_type == 'STAR - Counts') %>%
filter( access == 'open') %>%
filter( file_id == '0877bc64-fbf4-427f-a889-21a4a9102600') %>%
manifest()
head(ge_manifest)
But I do not see any column resembling values from my points A or B
EDIT Getting a bit closer:
files_ids <- c("0877bc64-fbf4-427f-a889-21a4a9102600",
"08837ae7-6f4f-4aa1-8722-7c404b66ed75")
case_ids <- cases() %>%
filter(~ project.project_id == "TCGA-BRCA") %>%
filter( files.file_id == files_ids) %>%
ids()
#case_ids contains '30ec8b1f-28c4-4f46-8a1b-a8d51e558c7d', '87b85935-a058-44ad-8fb6-8511130eaffe'
Improved the tags as suggested. While I have posted the code using two particular R libraries I do not care if the solution uses GDC Python API or even
curl
. Just to make some sense from the TCGA-BRCA expression data i.e. starting with p53 mutations I need to have as a minimum patient_id, mutation_type, mutation_site, expression_tsv_file. While one can get few multicolumn TSVs from GDC www and work with these, I would prefer to have this as a reproducible code. I am aware that some data (checked drug therapy as a substitute for guessing breast cancer subtypes) is not properly curated (misspelled names, compound vs commercial drug names etc.) so "some custom coding required".