Microarray Probes to Ensembl ID or Gene ID - clariomdhumanprobeset.db
1
0
Entering edit mode
5.1 years ago
Scott McKay ▴ 30

I am trying to pull DEG lists from multiple GEO datasets to cross analyze. Is there some way (in either R or python3) that will allow me to convert the probe IDs to something more universal? Ensembl ID, HGNC ID, or Gene ID? Please let me know. Thanks!

R python microarray probe gene • 3.0k views
ADD COMMENT
1
Entering edit mode

You can try two things (assuming your dataset used Affymetrix Human Genome U133 Plus 2.0 Array):

Use BioMaRt

library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
probeids=c('200007_at', '200011_s_at', '200012_x_at')
getBM(attributes=c('affy_hg_u133_plus_2', 'hgnc_symbol'), 
      filters = 'affy_hg_u133_plus_2', 
      values = probeids, 
      mart = ensembl)

Use GEOquery

library(GEOquery)
gse <- getGEO(GSE_id,GSEMatrix=TRUE)
featureData <- as.data.frame(gse[[1]]@featureData@data)
ID_mapping <- featureData[,c(1,11)]
ADD REPLY
0
Entering edit mode

What should I do if the array is not in biomaRt?

ADD REPLY
0
Entering edit mode

Which array is it? - try the manufacturer's website for the annotation. Also look at the Bioconductor annotation packages: https://www.bioconductor.org/packages/release/data/annotation/

ADD REPLY
0
Entering edit mode

Its from the Affymetrix Clariom D Assay

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Would I just download the annotation package and then run the same script as above and just swap the attribute and filter?

ADD REPLY
0
Entering edit mode

Yes, I posted a solution below for that package.

ADD REPLY
0
Entering edit mode

Yes, but you need to know the array type that you are using. Take a look at this example for Affymetrix U133 Plus 2.0: A: Affymetrix Human Genome U133 Plus 2.0 Array

ADD REPLY
0
Entering edit mode
5.1 years ago

Response from this comment (above): C: Microarray Probes to Ensembl ID or Gene ID

Oh yes, you just need to use a simple lookup:

# install package (large; > 400MB)
BiocManager::install("clariomdhumanprobeset.db")

# load package
require('clariomdhumanprobeset.db')

# store the probe names (probably rownames of your expression object)
IDs <- c("PSR1700192228.hg.1","PSR1700192231.hg.1","PSR2000155490.hg.1",
  "JUC2000052683.hg.1","PSR0800175519.hg.1","JUC0800062325.hg.1")

# look up the probes
mapIds(
  clariomdhumanprobeset.db,
  keys = IDs,
  column = 'SYMBOL',
  keytype = 'PROBEID')

'select()' returned 1:1 mapping between keys and columns
PSR1700192228.hg.1 PSR1700192231.hg.1 PSR2000155490.hg.1 JUC2000052683.hg.1 
           "CD79B"            "CD79B"             "CDH4"             "CDH4" 
PSR0800175519.hg.1 JUC0800062325.hg.1 
         "RUNX1T1"          "RUNX1T1"

To see other options of what data can be returned, run:

keytypes(clariomdhumanprobeset.db)

 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
[21] "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIGENE"      "UNIPROT"

There is also an example in Section 17.4.4 Gene annotation of limma.

Kevin

ADD COMMENT
0
Entering edit mode

Sorry Kevin

I have Rosetta probe identifiers

I have a Agilent microarray gene expression matrix like this by weird gene IDs in rows.

> head(mat[1:10,1:5])
            GSM482796 GSM482797 GSM482798 GSM482799 GSM482800
10019475365     0.243    0.0176    0.1200    0.0994    0.0782
10019481149     0.504    0.1700    0.2690    0.2640    0.2070
10019495284     0.247    0.0300    0.0993    0.0113    0.1440
10019687586     0.148   -0.0542   -0.0408   -0.0072   -0.0924
10019713746     0.953    0.3400    0.6800    0.2300    0.5640
10019799479     0.672    0.2130    0.2470    0.1610    0.4050
>

> dim(mat)
[1] 39302    76
>

There is matched gene symbol for each of these identifiers in another matrix

> head(matched)
          Gene.symbol
174996658      USHBP1
174996659      USHBP1
174996660      USHBP1
174996661      USHBP1
174996662      USHBP1
174996663      USHBP1
> 

> dim(matched)
[1] 23107     1
>

How I can have matched gene symbol with probe identifiers in the row names of my expression matrix please? The problem is, for one gene symbol we may have different probe identifiers; For instance for USHBP1 we have 174996658, 174996659, 174996660, 174996661, 174996662, 174996663. So really I don't know what to do know

I tried

> merged <- merge(mat, matched) 
Error: cannot allocate vector of size 6.8 Gb
ADD REPLY
0
Entering edit mode

Hey, you can use the limma::avereps to summarise expression over the probes that target the same gene.

ADD REPLY

Login before adding your answer.

Traffic: 2346 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6