Question: Discrepancies in TCGAbiolinks subtype data
0
gravatar for atakanekiz
3 days ago by
atakanekiz190
atakanekiz190 wrote:

Hello,

I have a question regarding accessing the subtype information associated with TCGA projects using TCGAbiolinks package (in this example, specifically COAD but my question applies to other projects including SKCM for instance)

When I download the RNAseq experiment as a SummarizedExperiment object I can access the metadata associated with the samples by calling colData(coad). In this data frame, there is information regarding MSI (microsatellite instability) status of tumors. The information I get from there is the following:

# Prepared coad object previously by using GDCdownload and GDCprepare functions


meta <- as.data.frame(colData(coad))

dim(meta)
#>[1] 521 102

summary(meta$subtype_MSI_status)
#>                      MSI-H         MSI-L           MSS Not Evaluable          NA's 
#>            0            40            42           126             0           313

Alternatively, I can also download subtype information using TCGAquery_subtype function. When I do that and look at the MSI data in the downloaded data frame, this is what I see:

subtype <- TCGAbiolinks::TCGAquery_subtype("COAD")

dim(subtype)
#>[1] 276  45

summary(subtype$MSI_status)
#>                      MSI-H         MSI-L           MSS Not Evaluable 
#>           0            38            44           193             1

A similar discrepancy is also present when comparing survival times between SummarizedExperiment and TCGAquery_subtype data frames. One has a shorter followup time than the other for some patients (ie. the patient is censored at an early date with alive vital_status in one data frame whereas he/she appears deceased in the other data frame at a later time point.

What is the reason for the discrepancy between different subtype data? I remember having similar issues with SKCM (both for subtype and survival data). I would appreciate if you can let me know which is the more accurate version to use.

Best, Atakan

ADD COMMENTlink modified 3 days ago • written 3 days ago by atakanekiz190
0
gravatar for Kevin Blighe
3 days ago by
Kevin Blighe54k
Kevin Blighe54k wrote:

There are discrepancies 'everywhere' in the TCGA data because the primary data has been taken and re-analysed by countless third parties. The primary data itself is also constantly evolving. From where. exactly, did you obtain the SummarizedExperiment data?

ADD COMMENTlink written 3 days ago by Kevin Blighe54k

I prepared the SE object using query function via TCGAbiolinks as follows:

query_harmonized_gene <- GDCquery(project = i,
                                    data.category = "Transcriptome Profiling",
                                    data.type = "Gene Expression Quantification",
                                    workflow.type = "HTSeq - Counts",
                                    legacy = FALSE)

GDCdownload(query_harmonized_gene, method="client", files.per.chunk = 10, directory = "./")

rnaseq_data <- GDCprepare(query_harmonized_gene, directory = "./")
ADD REPLYlink written 3 days ago by atakanekiz190
1

I see - thank you. Unfortunately, one could spend an entire day trying to figure out the cause of the discrepancy [and I have done this in the past]. To avoid that, however, I would encourage you to simply use either function and document the one that you used. The developers of TCGAbiolinks are neither too active here nor on Bioconductor, so, you could be waiting some time to hear their response. You could post it as an issue on their GitHub page for the package, though.

ADD REPLYlink written 3 days ago by Kevin Blighe54k
1

Thanks for the insight. That was been my approach so far. Looking both places seems helpful for getting the best out of TCGA data.

ADD REPLYlink written 2 days ago by atakanekiz190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1192 users visited in the last hour