Question

how to get ER, PR and HER2 data from TCGA BRCA

0

Entering edit mode

3.8 years ago

StartR ▴ 30

Hi I have dowloaded the BRCA data from TCGA using TCGABiolinks

I have done this:

BRCARnaseqSE <- GDCprepare(query.a, directory = "BRCA_all")
sample.info <- SummarizedExperiment::colData(BRCARnaseqSE)

Now I want to get data on ER, PR and HER2 - positive, negative samples, but I can not find any such columns. Here is the description of sample.info

names(sample.info)
 [1] "sample"                                      "patient"                                     "barcode"                                    
 [4] "shortLetterCode"                             "definition"                                  "days_to_recurrence"                         
 [7] "ajcc_staging_system_edition"                 "days_to_last_follow_up"                      "classification_of_tumor"                    
[10] "age_at_diagnosis"                            "icd_10_code"                                 "prior_malignancy"                           
[13] "morphology"                                  "created_datetime.x"                          "last_known_disease_status"                  
[16] "tumor_stage"                                 "updated_datetime.x"                          "days_to_last_known_disease_status"          
[19] "ajcc_pathologic_t"                           "treatments"                                  "year_of_diagnosis"                          
[22] "synchronous_malignancy"                      "state.x"                                     "ajcc_pathologic_m"                          
[25] "progression_or_recurrence"                   "prior_treatment"                             "site_of_resection_or_biopsy"                
[28] "ajcc_pathologic_n"                           "days_to_diagnosis"                           "tissue_or_organ_of_origin"                  
[31] "diagnosis_id"                                "tumor_grade"                                 "primary_diagnosis"                          
[34] "ajcc_pathologic_stage"                       "created_datetime.y"                          "cigarettes_per_day"                         
[37] "state.y"                                     "bmi"                                         "weight"                                     
[40] "exposure_id"                                 "height"                                      "alcohol_intensity"                          
[43] "alcohol_history"                             "updated_datetime.y"                          "years_smoked"                               
[46] "gender"                                      "created_datetime"                            "days_to_birth"                              
[49] "state"                                       "race"                                        "ethnicity"                                  
[52] "demographic_id"                              "year_of_birth"                               "vital_status"                               
[55] "age_at_index"                                "year_of_death"                               "updated_datetime"                           
[58] "days_to_death"                               "bcr_patient_barcode"                         "project_id"                                 
[61] "disease_type"                                "dbgap_accession_number"                      "name"                                       
[64] "released"                                    "releasable"                                  "primary_site"                               
[67] "is_ffpe"                                     "subtype_patient"                             "subtype_Tumor.Type"                         
[70] "subtype_Included_in_previous_marker_papers"  "subtype_vital_status"                        "subtype_days_to_birth"                      
[73] "subtype_days_to_death"                       "subtype_days_to_last_followup"               "subtype_age_at_initial_pathologic_diagnosis"
[76] "subtype_pathologic_stage"                    "subtype_Tumor_Grade"                         "subtype_BRCA_Pathology"                     
[79] "subtype_BRCA_Subtype_PAM50"                  "subtype_MSI_status"                          "subtype_HPV_Status"                         
[82] "subtype_tobacco_smoking_history"             "subtype_CNV.Clusters"                        "subtype_Mutation.Clusters"                  
[85] "subtype_DNA.Methylation.Clusters"            "subtype_mRNA.Clusters"                       "subtype_miRNA.Clusters"                     
[88] "subtype_lncRNA.Clusters"                     "subtype_Protein.Clusters"                    "subtype_PARADIGM.Clusters"                  
[91] "subtype_Pan.Gyn.Clusters"

I cannot see any info related to ER status, or something like er_status_by_ihc, or pr_status_by_ihc or her2_status_by_ihc

Please help!

Thanks!

BRCA TCGA • 2.2k views

ADD COMMENT • link updated 5 months ago by Ram 43k • written 3.8 years ago by StartR ▴ 30

score 1 · Answer 1 · 2020-06-27

1

Entering edit mode

3.8 years ago

Kevin Blighe 87k

Not sure about TCGAbiolinks but the information is definitely available at the GDC Data Portal: A: How to download triple negative breast cancer RNA-seq fpkm data from GDC.

You can feasibly use that information and link it up to your TCGAbiolinks output.

Kevin

ADD COMMENT • link 3.8 years ago by Kevin Blighe 87k

score 0 · Answer 2 · 2023-11-20

I came across this post because I had the same question. Here's the way I did it 3 years ago (saved in an old code file) and tested today (20-Nov-2023):

library(tidyverse)
library(TCGAbiolinks)
query <- GDCquery(project = "TCGA-BRCA",
                  data.category = "Clinical",
                  data.type = "Clinical Supplement",
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.all <- GDCprepare(query)

tcga_brca.clin <- clinical.all$clinical_patient_brca

tcga_brca.tnbc_samples <- tcga_brca.clin %>%
    filter(er_status_by_ihc == 'Negative' &
                      pr_status_by_ihc == 'Negative' &
                      her2_status_by_ihc == 'Negative') %>%
    pull(bcr_patient_barcode)

tcga_brca.er_samples <- tcga_brca.clin %>%
    filter(er_status_by_ihc == 'Positive' &
                      her2_status_by_ihc != 'Positive') %>%
    pull(bcr_patient_barcode)

tcga_brca.her2_samples <- tcga_brca.clin %>%
    filter(her2_status_by_ihc == 'Positive') %>%
    pull(bcr_patient_barcode)

The BCR Biotab gives info on very limited number of samples. The BCR XML data.format has info on a lot more samples but I cannot find a function that parses it. Even the GDCprepare_clinic function seems to work on a rather limited subset of XML fields. I'm writing my own hack, will update as soon as it's done.

I think I was wrong - both the BioTab and XML give us the same data, just in a different number of files. I ran a preliminary test: the 116 TNBC patient IDs (TCGA-XX-XXXX) overlap a 100% between the two formats.