Entering edit mode
3.7 years ago
Kevin Blighe
89k
Posted in response to Annotation Affymetrix probesets to Gene symbols Posting to ensure greater future accessibility to other users.
1, retrieve dataset from GEO
library(GEOquery)
gset <- getGEO("GSE133824", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL17586", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
head(rownames(gset))
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"
[6] "2827995_st"
2a, convert probe IDs via hta20transcriptcluster.db
require(hta20transcriptcluster.db)
mapping <- mapIds(
hta20transcriptcluster.db,
keys = rownames(gset),
column = 'SYMBOL',
keytype = 'PROBEID')
head(mapping,40)
head(mapping[!is.na(mapping)])
TC01000001.hg.1 TC01000003.hg.1 TC01000005.hg.1 TC01000007.hg.1 TC01000009.hg.1
"DDX11L1" "OR4F5" "LINC01001" "LINC01061" "OR4F29"
TC01000010.hg.1
"LOC101928626"
Verify alignment between input probe IDs and annotation output
all(names(mapping) == rownames(gset))
[1] TRUE
We can also create an annotation table, but this will not strictly be aligned to the input probe IDs
mapping <- select(
hta20transcriptcluster.db,
keys = rownames(gset),
column = c('SYMBOL', 'ENTREZID', 'ENSEMBL'),
keytype = 'PROBEID')
head(mapping[!is.na(mapping$SYMBOL),],20)
PROBEID SYMBOL ENTREZID ENSEMBL
2996 TC01000001.hg.1 DDX11L1 100287102 ENSG00000223972
2997 TC01000001.hg.1 DDX11L9 100288486 ENSG00000248472
2999 TC01000003.hg.1 OR4F5 79501 ENSG00000186092
3001 TC01000005.hg.1 LINC01001 100133161 <NA>
3002 TC01000005.hg.1 LOC100132287 100132287 <NA>
3003 TC01000005.hg.1 LOC100132062 100132062 <NA>
3004 TC01000005.hg.1 LOC100133331 100133331 <NA>
3006 TC01000007.hg.1 LINC01061 401149 <NA>
3008 TC01000009.hg.1 OR4F29 729759 ENSG00000284733
3009 TC01000009.hg.1 OR4F3 26683 ENSG00000230178
3010 TC01000009.hg.1 OR4F16 81399 ENSG00000284662
3011 TC01000010.hg.1 LOC101928626 101928626 ENSG00000230021
3019 TC01000018.hg.1 LINC01128 643837 ENSG00000228794
3020 TC01000019.hg.1 LOC284600 284600 <NA>
3021 TC01000020.hg.1 SAMD11 148398 ENSG00000187634
3022 TC01000021.hg.1 KLHL17 339451 ENSG00000187961
3023 TC01000022.hg.1 PLEKHN1 84069 ENSG00000187583
3024 TC01000023.hg.1 ISG15 9636 ENSG00000187608
3025 TC01000024.hg.1 AGRN 375790 ENSG00000188157
3026 TC01000025.hg.1 LOC100288175 100288175 ENSG00000217801
2b, via biomaRt
You can also use biomaRt and generate an annotation table for this array via:
require(biomaRt)
ensembl <- useMart(
'ensembl',
dataset = 'hsapiens_gene_ensembl')
annot <- getBM(
attributes = c(
'affy_hta_2_0',
'hgnc_symbol',
'ensembl_gene_id',
'entrezgene_id',
'gene_biotype'),
mart = ensembl)
head(annot)
affy_hta_2_0 hgnc_symbol ensembl_gene_id entrezgene_id gene_biotype
1 TC0M000002.hg MT-TF ENSG00000210049 NA Mt_tRNA
2 TC07000959.hg MT-TF ENSG00000210049 NA Mt_tRNA
3 TC11001412.hg MT-TF ENSG00000210049 NA Mt_tRNA
4 TC0M000002.hg MT-RNR1 ENSG00000211459 NA Mt_rRNA
5 TC07000959.hg MT-RNR1 ENSG00000211459 NA Mt_rRNA
6 TC11001412.hg MT-RNR1 ENSG00000211459 NA Mt_rRNA
Kevin
Thank you for this resource!
I'm trying to analyze a microarray experiment I found on NCBI. According to the NCBI info (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76515), the run was performed using
[HTA-2_0] Affymetrix Human Transcriptome Array 2.0
I tried your procedure above with
hta20transcriptcluster.db
This successfully mapped just over half of the probes (~44,000 out of ~70,000) to ENSEMBL gene IDs. So...any idea what's going on with the unmapped IDs in this case? Does this just mean they are unmappable or do I need to cross-reference a different database?
The unmapped probe IDs are like
2824546_st
andTC0X002061.hg.1