I am trying to do Differential Expression Analysis of Genes of Normal vs. Breast cancer patients. For that ourpose, I chose GEO data GSE62944 as it contains 9264 Tumour Samples and 741 normal samples.
I load the expression set using code
library(AnnotationHub) ah = AnnotationHub() query(ah , "GSE62944")
What I see is:
AnnotationHub with 1 record # snapshotDate(): 2016-03-09 # names(): AH28855 # $dataprovider: GEO # $species: Homo sapiens # $rdataclass: ExpressionSet # $title: RNA-Sequencing and clinical data for 7706 tumor samples from The Cancer Genome Atlas # $description: TCGA RNA-seq Rsubread-summarized raw count data for 7706 tumor samples, represented as an R / Bioconductor... # $taxonomyid: 9606 # $genome: hg19 # $sourcetype: tar.gz # $sourceurl: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944 # $sourcelastmodifieddate: NA # $sourcesize: NA # $tags: TCGA, RNA-seq, Expression, Count # retrieve record with 'object[["AH28855"]]'
Why does the title say 7706 tumor samples? Does this expression set contains normal samples at all? How would I access the normal samples?
I subset the breast cancer patient samples using code:
tcga_data <- ah[["AH28855"]] brca_data <- tcga_data[, which(phenoData(tcga_data)$CancerType=="BRCA")]
How can I subset both breast cancer and normal samples from the entire dataset?
Is there a way to subset specific genes (i.e rows ) from the data set?
Help would be appreciated