I have some gene expression data from a publicly available array data using the SurePrint G3 Human GE 8x60K Microarray platform. I am trying to annotate the Agilent probe ids with entrezIDs using biomaRt in R. However it appears that several Agilent IDs are missing in biomaRt.
As I am not familiar with the Agilent's technology I am not sure whether this issue may be due to the design of the array (i.e. custom probes etc) or whether there is missing data in the biomaRt package, or of course I have an error in my code.
The link provides the table of Agilent probe IDs along with other gene identifiers used by the expression data set
## This is just loading in the table from the link above. probe = read.delim("GPL15931-probe_annotation.txt", comment.char = '#') library(biomaRt) ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") ## Should return all the agilent probe ids in biomaRt agilent = getBM(attributes=c( 'efg_agilent_sureprint_g3_ge_8x60k' ),values="*", mart= ensembl)
First thing that strikes me is that for a 60K array there is only 31K probe IDs returned. Am I missing something here with either the technology or the code?
If I look for which probes are matched between the two datasets the difference is 11K probes. All the probes in biomaRt match but there is missing 11K that are in the GEO dataset.
table(probe$ID %in% agilent$efg_agilent_sureprint_g3_ge_8x60k)
Is there any agilent bioconductor packages that might have more complete IDs and gene identifiers? Any other thoughts on how to work around this problem?