Question

Affymetrix: probeset or transcript cluster?

1

Entering edit mode

3.6 years ago

salamandra ▴ 550

I used to first normalize Affymetrix microarray data with RMA by 'probeset':

oligo::rma(rawData, background=TRUE, normalize=TRUE, target="probeset")

and then convert probe ids to gene ids with:

select(microarrayPackage, keys = as.character(ids), column = c('PROBEID','ENSEMBL'), keytype='PROBEID')

But now some annotations that used to be '.db' were replaced by 'transcriptcluster.db' and 'probeset.db'. Can I run exactly same code using the 'probeset.db' ? or should I use 'transcriptcluster'?

I know there is a post on this, but still I can't understand if when running code above I should use one or the other.

Affymetrix microarray oligo R • 2.4k views

ADD COMMENT • link 3.6 years ago by salamandra ▴ 550

score 2 · Answer 1 · 2020-09-04

2

Entering edit mode

3.6 years ago

Kevin Blighe 87k

Hi, it is important to know which array you are using. Based on experience, it is an 'ST' array, likely the 1.0 or 1.1 Mo- or Hu-Gene.

Also, can you post some of the IDs.

The transcript cluster DB is generally 'better' (ask me how to define better in a short sentence and I would not be able t do it).

ADD COMMENT • link 3.6 years ago by Kevin Blighe 87k

0

Entering edit mode

I am doing this for studies on GEO, so the array platform varies.

Examples:

GSE75918 study. the array platform is [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]. it works with both transcriptcluster and probeset., but with both there are many NAs. eg of ids: "7893430" "7893431" "7893432" "7893433" "7893434"

GSE63296 with array [HTA-2_0] Affymetrix Human Transcriptome Array 2.0 [transcript (gene) version]. ids: "47419722_st" "47419725_st" "47419729_st" "47419731_st"

GSE66529 study with array [HuGene-2_0-st] Affymetrix Human Gene 2.0 ST Array [transcript (gene) version. ids: "16651727" "16651729" "16651731" "16651733" "16651735"

And this last one I can't even find the package on bioconductor:

GSE55487 with array [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]

ADD REPLY • link 3.6 years ago by salamandra ▴ 550

0

Entering edit mode

Hey, thanks for sharing that. I would use the following packages for these:

[GSE75918][1] study. the array platform is [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]. it works with both transcriptcluster and probeset., but with both there are many NAs. eg of ids: "7893430" "7893431" "7893432" "7893433" "7893434"

hugene10sttranscriptcluster.db

----------

[GSE63296][2] with array [HTA-2_0] Affymetrix Human Transcriptome Array 2.0 [transcript (gene) version]. ids: "47419722_st" "47419725_st" "47419729_st" "47419731_st"

hta20transcriptcluster.db

----------

[GSE66529][3] study with array [HuGene-2_0-st] Affymetrix Human Gene 2.0 ST Array [transcript (gene) version

hugene20sttranscriptcluster.db

----------

[GSE55487][4] with array [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]

huex10sttranscriptcluster.db

----------

A one-to-one mapping should be achievable via, for example:

require(hugene20sttranscriptcluster.db)

probes <- rownames(eset)
samIDs <- colnames(eset)
annotLookup <- select(hugene20sttranscriptcluster.db, keys = probes,
  columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))

ADD REPLY • link 3.6 years ago by Kevin Blighe 87k

0

Entering edit mode

thank you for answering, but meanwhile I tested how many ids were converted successfully in both probeset and transcriptcluster with:

sum(!is.na(resultTable$ENSEMBL))

1st example - GSE75918: hugene10stprobeset.db: 327424 ids; hugene10sttranscriptcluster.db: 215 ids

2nd example - GSE63296: hta20probeset.db: 0 ids ; hta20transcriptcluster.db: 40567 ids

3rd example - GSE66529: hugene20stprobeset.db: 376741; hugene20sttranscriptcluster.db: 0 ids

So in the 1st and 3rd examples, actually seems to be probeset that converts more ids...

ADD REPLY • link 3.6 years ago by salamandra ▴ 550

1

Entering edit mode

It will depend on how you are summarising the data during RMA normalisation. You seem to be summarising at the level of the probe set, so, the probeset annotation will be needed.

Unless you have good justification, you should be using XYXtranscriptcluster.db with:

oligo::rma(rawData,
  background = TRUE,
  normalize=TRUE,
  target = 'core')

Please see the difference, here: C: Human Exon array probeset to gene-level expression

ADD REPLY • link 3.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Doing the way you say for 1st example gives only 215 successfully converted ids of a total of 33297

ADD REPLY • link 3.6 years ago by salamandra ▴ 550

0

Entering edit mode

ah ok, I was doing wrong. thank you!

ADD REPLY • link 3.6 years ago by salamandra ▴ 550

1

Entering edit mode

All good then? / Tutto bene? / Todo bien? / Tudo bem?

ADD REPLY • link 3.6 years ago by Kevin Blighe 87k