I am trying to use BRCA data downloaded using TCGAbiolinks to do a differential expression analysis.
I want to do an analysis of the matched-paired tumour-normal samples but can't work out how to identify these cases from the samples I have.
This is the code I have so far and my attempt to subset the data -
query.BRCA.tumour <- GDCquery(project = "TCGA-BRCA",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
experimental.strategy = "RNA-Seq",
sample.type = "Primary solid Tumor",
file.type = "results")
GDCdownload(query.BRCA.tumour, files.per.chunk = 200)
prep.BRCA.tumour <- GDCprepare(query = query.BRCA.tumour,
save = TRUE,
summarizedExperiment = TRUE,
save.filename = "BRCAtumour.rda")
query.BRCA.normal <- GDCquery(project = "TCGA-BRCA",
legacy = TRUE,
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
experimental.strategy = "RNA-Seq",
sample.type = "Solid Tissue Normal",
file.type = "results")
GDCdownload(query.BRCA.normal, files.per.chunk = 200)
prep.BRCA.normal <- GDCprepare(query = query.BRCA.normal,
save = TRUE,
save.filename = "BRCAnormal.rda",
summarizedExperiment = TRUE)
Matched.Samples.Normal <- subset(prep.BRCA.normal,
select = colData(prep.BRCA.normal)$patient
%in% colData(prep.BRCA.tumour)$patient)
Matched.Samples.Tumour<- subset(prep.BRCA.tumour,
select = colData(prep.BRCA.normal)$patient
%in% colData(prep.BRCA.tumour)$patient)
The download works perfectly and I have used the prep.BRCA objects for unmatched DEA analysis without any trouble.
However, the results I get for the Matched.Samples.Normal and .Tumour are RangedSummarizedExperiments with the same number of samples as the original prep.BRCA.normal and .tumour rather than the expected 112 matched paired that I know are available.
Can anyone shed some light as to why it isn't working and provide a solution?
Thank you.
Ah thank you so much, such a silly error!
When I used GDCDataCommons to identify files for matched paired samples for gene expression it came back with 112 but using the correct code with TCGAbiolinks find 113 which is a bit confusing.