Human Exon array probeset to gene-level expression
1
4
Entering edit mode
3.9 years ago
mforde84 ★ 1.3k

Hi,

I'm working with some Affy HuEx arrays.

I'm able to do all of the necessary backgrounding, normalization, and summarization steps for the arrays, however I'm a little stuck on what's a reasonable way to collapse transcript level probesets to gene-level estimates. Should I do something as simple as aggregate by mean / median? Or is there a more robust way to estimate general abundance of a gene from these arrays?

Marty

affy microarray • 2.5k views
0
Entering edit mode

So anyone have any ideas?

5
Entering edit mode
3.9 years ago

Hey Marty,

I've analysed data from this chip recently (2016). Are you using the limma and oligo packages in R in order to process these?

When you get to the rma() function, you can specify there how you want to summarise the expression values:

## Summarised per gene

rma(project, background=TRUE, normalize=TRUE, target="core")


If your array is an 'Exon' array, this may still only summarise to exon-level, requiring you to perform a further [manual] summarisation to gene-level. There are also other options available specifically for these 'Exon' arrays:

• target = ’full'
• target = ’extended’

## Summarised per probe set:

rma(project, background=TRUE, normalize=TRUE, target="probeset")


Probe sets target different things, depending on the array type.

## ------------------------------------

NB - the behaviour of target will greatly depend on the array type!

Kevin

1
Entering edit mode

This is of great help. If target isn't provided RMA by default normalizes, summarizes based on gene-level. ?rma doesn't provide any information either on this. Thanks.

0
Entering edit mode

Yes this is one of fundamental and critical things that is surprisingly not mentioned in many of the tutorials! I cannot remember how I figured it out, but it was a long time ago.

1
Entering edit mode

Yes, I ran analyses with target as "core" and without anything assuming without target RMA will do probeset. I looked at the row counts for both results and everything was same. wasted many hours to make sure I did everything OK before stumbling onto this thread. :)

0
Entering edit mode

thanks kevin. what would you suggest for collapsing multimapped probesets? aggregate by mean / median, or something different?

1
Entering edit mode

I may have missed exactly what you mean by 'multimapped probesets'? During RMA normalisation, a Tukey's 'median polish' is applied using information from all probesets. Are you just referring to summarising transcript isoforms into a single expression value for each gene? - in the past, I have always used the mean in this case and results have been as expected.

0
Entering edit mode

my mistake, but yes just summarizing transcripts to gene level. i checked both the results for core and probeset summarization options. while im using SCAN.UPC instead, they are both appear to be generating estimates for transcripts. So to convert to entrez id, we simply select using key types matching in the transcriptcluster.db, then aggregate by entrez id to get gene level estimates.

thanks for the help.

0
Entering edit mode

I am trying to summarize on gene level with target=core, but get this error: Error in .local(object, ...) : unused argument (target = "core"), even when specifying oligo::rma( Do you know why is this happens?

0
Entering edit mode

Which array are you using? are you specifying:

rma(..., target = core)


...or

rma(..., target = 'core')


???

0
Entering edit mode

I tried both: array is Affymetrix Human Genome U133 Plus 2.0 Array from this study: GSE28799 Code is:

library(oligo)
library(GEOquery)
platform = 'HG-U133_Plus_2' # specific affymetrix microarray platform
studyId <- 'study10'
GEOId <- 'GSE28799'# GEO Id of the study/series
GPLId <- 'GPL570'

treatment1 <- 'CNSCs'
treatment2 <- 'CSCs'

treatment1Samples <- c('GSM713222', 'GSM713223', 'GSM713224')
treatment2Samples <- c('GSM713227', 'GSM713228', 'GSM713229')
sampleNames <- c(treatment1Samples, treatment2Samples)
studyDir <- paste0(dataDir,studyId) # path of directory of specific study
dir.create(file.path(studyDir), suppressWarnings('already exists')) # creates a new directory for study
getGEOSuppFiles(GEOId, baseDir = studyDir) # get .tar file with study raw data (it's a supplementary file at GEO page)
tarfile <- paste0(studyDir, '/', GEOId,'/',GEOId,'_RAW.tar') # path to tarfile
FilesInside <- untar(tarfile, list = T) # list file names inside tarfile,have to set list = T to only list them
CELfiles <- grep('CEL', FilesInside, value = T) # select names of CEL files only.
SampleCELfiles <- CELfiles[grepl(paste(sampleNames,collapse='|'), CELfiles)] # select names of CEL files of the samples we want to analyze only
untar(tarfile, files = SampleCELfiles, exdir = paste0(studyDir,'/', GEOId, '/CEL')) # untar CEL files of samples we want to work with into subdirectory /CEL
pathSampleCELfiles <- paste0(studyDir, '/', GEOId,'/CEL/', SampleCELfiles) # path of CEL files of samples we want to work with
rawData <- read.celfiles(pathSampleCELfiles) # read CEL files of samples we want to work with.
normData <- oligo::rma(rawData, background=TRUE, normalize=TRUE, target=core)

1
Entering edit mode

Ah, the issue is the array type, i.e., the U133. The target functionality only works for certain Affymetrix array designs, i.e., those that have a 'Gene' or 'Exon' in the name, and also usually have 'ST', reflecting the different probe design / layout.

So, you have to run rma() without target.

I can provide more information on the differences between these 2 broad classes of Affy array designs, if you wish.

If you wish to summarise over transcripts after you normalise with rma(), you can do something rudimentary like using aggregate() (base R) or avereps() (limma)

0
Entering edit mode

could you then provide that info please? so in this case, by doing rma() automatically summarizes by transcript instead of gene is that it?

1
Entering edit mode

The ST 'Exon' and 'Gene' arrays differ from the U133 and earlier arrays as follows:

• mismatch (MM) probes are mostly absent
• to accommodate more targets, feature size reduced from 121µm^2 to 25µm^2
• target entire length of gene, whereas U133 mainly targets 3 only
• manufacturer-supplied annotations differ

The absence of MM probes affects downstream processing and means, for example, that the affy Bioconductor package cannot process ST arrays. Also, some normalisation (e.g. mas5) and QC methods that use perfect match (PM) and MM probes cannot be used with these arrays.

The fact that the U133 arrays only target a certain part of the mRNA (third point above) additionally means that probe summarisation via target` is not required. You may still see duplicate gene entries after you translate the probe IDs to gene symbols, though.

0
Entering edit mode

even after translating probe IDs to ensembl/entrez gene ids..thank you

0
Entering edit mode

I saw in another post that one probe-set targets one exon. So, does it mean that two probe-sets may be targeting two different exons of same gene, and that's why we get even after summarizing a U133 array different probe set ids corresponding to same ensembl gene id? Or is there only one gene per probe-set and we get same ensembl gene id for two probe-sets cause of annotation problems?

1
Entering edit mode

I think that they may be targeting different transcript isoforms of the same gene, but cannot confirm. The UCSC genome browser actually stores Affymetrix probe IDs, so, you can paste in a probe ID to see what it targets. This may help to visualise better these probes that target the same gene. I have never done a comprehensive analysis / review of this, though... I'm not sure anybody has due to the fact that each array design differs from the next.

0
Entering edit mode

If they are different isoforms, doing an average of two might not be appropriate cause one might no be expressed and that brings gene expression down..

0
Entering edit mode

Yes, of course, in which case you will have to write some code to check for these situations in which the expression is so low for one probe such that it is negligible.

0
Entering edit mode

From GeneAnnot site we can see that different probe sets that target same gene have variable numbers of same probes. For example for gene MAPK1 there is: 1552264_a_at probe set has probes 1 to 11, while 224621_at probe set has probes 1 to 3 and 8 to 11. For same probe set, different combinations of probes identify different transcripts, so the same probe set targets different transcript isoforms.

0
Entering edit mode

That looks like a bit of a mess... but this, frequently, is how array designs appear when one looks deep into them.

The UCSC Genome Browser indicates that 224621_at targets the terminal exon, while the other (1552264_a_at) targets across the gene.