1, retrieve dataset from GEO

Question

Tutorial:Affymetrix HTA 2.0 id conversion

3

Entering edit mode

3.0 years ago

Kevin Blighe 88k

Posted in response to Annotation Affymetrix probesets to Gene symbols Posting to ensure greater future accessibility to other users.

1, retrieve dataset from GEO

library(GEOquery)
gset <- getGEO("GSE133824", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL17586", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

head(rownames(gset))
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"
[6] "2827995_st"

2a, convert probe IDs via `hta20transcriptcluster.db`

require(hta20transcriptcluster.db)
mapping <- mapIds(
  hta20transcriptcluster.db,
  keys = rownames(gset),
  column = 'SYMBOL',
  keytype = 'PROBEID')

head(mapping,40)
head(mapping[!is.na(mapping)])
TC01000001.hg.1 TC01000003.hg.1 TC01000005.hg.1 TC01000007.hg.1 TC01000009.hg.1 
      "DDX11L1"         "OR4F5"     "LINC01001"     "LINC01061"        "OR4F29" 
TC01000010.hg.1 
 "LOC101928626"

Verify alignment between input probe IDs and annotation output

all(names(mapping) == rownames(gset))
[1] TRUE

We can also create an annotation table, but this will not strictly be aligned to the input probe IDs

mapping <- select(
  hta20transcriptcluster.db,
  keys = rownames(gset),
  column = c('SYMBOL', 'ENTREZID', 'ENSEMBL'),
  keytype = 'PROBEID')

head(mapping[!is.na(mapping$SYMBOL),],20)
             PROBEID       SYMBOL  ENTREZID         ENSEMBL
2996 TC01000001.hg.1      DDX11L1 100287102 ENSG00000223972
2997 TC01000001.hg.1      DDX11L9 100288486 ENSG00000248472
2999 TC01000003.hg.1        OR4F5     79501 ENSG00000186092
3001 TC01000005.hg.1    LINC01001 100133161            <NA>
3002 TC01000005.hg.1 LOC100132287 100132287            <NA>
3003 TC01000005.hg.1 LOC100132062 100132062            <NA>
3004 TC01000005.hg.1 LOC100133331 100133331            <NA>
3006 TC01000007.hg.1    LINC01061    401149            <NA>
3008 TC01000009.hg.1       OR4F29    729759 ENSG00000284733
3009 TC01000009.hg.1        OR4F3     26683 ENSG00000230178
3010 TC01000009.hg.1       OR4F16     81399 ENSG00000284662
3011 TC01000010.hg.1 LOC101928626 101928626 ENSG00000230021
3019 TC01000018.hg.1    LINC01128    643837 ENSG00000228794
3020 TC01000019.hg.1    LOC284600    284600            <NA>
3021 TC01000020.hg.1       SAMD11    148398 ENSG00000187634
3022 TC01000021.hg.1       KLHL17    339451 ENSG00000187961
3023 TC01000022.hg.1      PLEKHN1     84069 ENSG00000187583
3024 TC01000023.hg.1        ISG15      9636 ENSG00000187608
3025 TC01000024.hg.1         AGRN    375790 ENSG00000188157
3026 TC01000025.hg.1 LOC100288175 100288175 ENSG00000217801

2b, via `biomaRt`

You can also use biomaRt and generate an annotation table for this array via:

require(biomaRt)
ensembl <- useMart(
  'ensembl',
  dataset = 'hsapiens_gene_ensembl')
annot <- getBM(
  attributes = c(
    'affy_hta_2_0',
    'hgnc_symbol',
    'ensembl_gene_id',
    'entrezgene_id',
    'gene_biotype'),
  mart = ensembl)

head(annot)
   affy_hta_2_0 hgnc_symbol ensembl_gene_id entrezgene_id gene_biotype
1 TC0M000002.hg       MT-TF ENSG00000210049            NA      Mt_tRNA
2 TC07000959.hg       MT-TF ENSG00000210049            NA      Mt_tRNA
3 TC11001412.hg       MT-TF ENSG00000210049            NA      Mt_tRNA
4 TC0M000002.hg     MT-RNR1 ENSG00000211459            NA      Mt_rRNA
5 TC07000959.hg     MT-RNR1 ENSG00000211459            NA      Mt_rRNA
6 TC11001412.hg     MT-RNR1 ENSG00000211459            NA      Mt_rRNA

Kevin

biomart HTA affymetrix Affymetrix 2.0 • 3.7k views

ADD COMMENT • link updated 5 months ago by MaxF ▴ 120 • written 3.0 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you for this resource!

I'm trying to analyze a microarray experiment I found on NCBI. According to the NCBI info (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76515), the run was performed using [HTA-2_0] Affymetrix Human Transcriptome Array 2.0

I tried your procedure above with hta20transcriptcluster.db

This successfully mapped just over half of the probes (~44,000 out of ~70,000) to ENSEMBL gene IDs. So...any idea what's going on with the unmapped IDs in this case? Does this just mean they are unmappable or do I need to cross-reference a different database?

The unmapped probe IDs are like 2824546_st and TC0X002061.hg.1

ADD REPLY • link 5 months ago by MaxF ▴ 120

1, retrieve dataset from GEO

2a, convert probe IDs via hta20transcriptcluster.db

2b, via biomaRt

2a, convert probe IDs via `hta20transcriptcluster.db`

2b, via `biomaRt`