4.2 years ago
Matina ▴ 230

Hi all,

I have downloaded the TCGA-BRCA RNA-seq data and the associated clinical information using the code below.

CancerProject <- "TCGA-BRCA"

query <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")

samplesDown <- getResults(query,cols=c("cases"))

dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
typesample = "TP")

queryDown <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = dataSmTP)

dataPrep <- GDCprepare(query = queryDown,
save = TRUE,
directory =  "BRC_RESULTS/TCGA/htseq_data/",
save.filename = "htseq_counts.rda", summarizedExperiment = TRUE)


In the clinical data there are several columns such as days_to_death or days_to_last_follow_up and other columns such as subtype_OS.Time or subtype_OS.event.

What is the difference between the columns having subtype_ at the begging and the rest and which one should I use for survival analysis? At the moment I have used the subtype_ columns for my analysis and I am wondering if this correct.

Thanks a lot,

Matina

Dear Matina,

what is your purpose with the RNA-Seq data ? DE analysis ? looking for example to inspect the expression of specific genes ? or looking for molecular subtype pattern and survival analysis ? i think you already got an answer from one of the creators of the R package in the github account, correct ?

Best,

Efstathios

Hi Efstathios,

I have a set of genes that I am interested in and I want to see if they are associated with clinical outcomes and molecular subtype patterns. You are right, I got an answer in the GitHub account.

4.2 years ago
atakanekiz ▴ 300

Hi Matina,

I would go with the days_to_death and days_to_last_follow_up (for alive patients) for survival analyses. I think stuff that starts with subtype_ might be manually curated data. I'm not 100% sure but, subtype_OS.Time sounds like the time period that the tumor was classified as a certain subtype (primary-metastatic-stage i-ii-iii etc). I think days_to_death is a more straightforward data type.

Atakan

Hi Atakan,

This is correct - I got an answer from one of the developers of TCGAbiolinks at the Github account saying that everything that starts with subtype_ is actually metadata from papers that analyzed the samples suggested to use days_to_death. In any case what is strange is that the subtype_ column for OS has clinical info for patients that in the days_to_last_follow_up column is shown as missing or they report completely different number of days.

Thanks again, Matina

4.2 years ago
igor 12k

You could also consider using the Pan-Cancer Atlas curated survival data from Xena:

Thank you very much Igor! I will have a look at this!