Question: Human Exon array probeset to gene-level expression
2
gravatar for mforde84
3.0 years ago by
mforde841.2k
mforde841.2k wrote:

Hi,

I'm working with some Affy HuEx arrays.

I'm able to do all of the necessary backgrounding, normalization, and summarization steps for the arrays, however I'm a little stuck on what's a reasonable way to collapse transcript level probesets to gene-level estimates. Should I do something as simple as aggregate by mean / median? Or is there a more robust way to estimate general abundance of a gene from these arrays?

Marty

microarray affy • 1.7k views
ADD COMMENTlink modified 3.0 years ago by Kevin Blighe65k • written 3.0 years ago by mforde841.2k

So anyone have any ideas?

ADD REPLYlink written 3.0 years ago by mforde841.2k
2
gravatar for Kevin Blighe
3.0 years ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

Hey Marty,

I've analysed data from this chip recently (2016). Are you using the limma and oligo packages in R in order to process these?

When you get to the rma() function, you can specify there how you want to summarise the expression values:

Summarised per gene

rma(project, background=TRUE, normalize=TRUE, target="core")

If your array is an 'Exon' array, this may still only summarise to exon-level, requiring you to perform a further [manual] summarisation to gene-level. There are also other options available specifically for these 'Exon' arrays:

  • target = ’full'
  • target = ’extended’

Summarised per probe set:

rma(project, background=TRUE, normalize=TRUE, target="probeset")

Probe sets target different things, depending on the array type.

------------------------------------

NB - the behaviour of target will greatly depend on the array type!

Kevin

ADD COMMENTlink modified 15 months ago • written 3.0 years ago by Kevin Blighe65k
1

This is of great help. If target isn't provided RMA by default normalizes, summarizes based on gene-level. ?rma doesn't provide any information either on this. Thanks.

ADD REPLYlink written 2.5 years ago by Bioinformatics_NewComer320

Yes this is one of fundamental and critical things that is surprisingly not mentioned in many of the tutorials! I cannot remember how I figured it out, but it was a long time ago.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Kevin Blighe65k
1

Yes, I ran analyses with target as "core" and without anything assuming without target RMA will do probeset. I looked at the row counts for both results and everything was same. wasted many hours to make sure I did everything OK before stumbling onto this thread. :)

ADD REPLYlink written 2.5 years ago by Bioinformatics_NewComer320

thanks kevin. what would you suggest for collapsing multimapped probesets? aggregate by mean / median, or something different?

ADD REPLYlink written 3.0 years ago by mforde841.2k
1

I may have missed exactly what you mean by 'multimapped probesets'? During RMA normalisation, a Tukey's 'median polish' is applied using information from all probesets. Are you just referring to summarising transcript isoforms into a single expression value for each gene? - in the past, I have always used the mean in this case and results have been as expected.

ADD REPLYlink written 3.0 years ago by Kevin Blighe65k

my mistake, but yes just summarizing transcripts to gene level. i checked both the results for core and probeset summarization options. while im using SCAN.UPC instead, they are both appear to be generating estimates for transcripts. So to convert to entrez id, we simply select using key types matching in the transcriptcluster.db, then aggregate by entrez id to get gene level estimates.

thanks for the help.

ADD REPLYlink written 3.0 years ago by mforde841.2k

I am trying to summarize on gene level with target=core, but get this error: Error in .local(object, ...) : unused argument (target = "core"), even when specifying oligo::rma( Do you know why is this happens?

ADD REPLYlink modified 10 months ago • written 10 months ago by salamandra330

Which array are you using? are you specifying:

rma(..., target = core)

...or

rma(..., target = 'core')

???

ADD REPLYlink written 10 months ago by Kevin Blighe65k

I tried both: array is Affymetrix Human Genome U133 Plus 2.0 Array from this study: GSE28799 Code is:

library(oligo) 
library(GEOquery)
platform = 'HG-U133_Plus_2' # specific affymetrix microarray platform
dataDir <- 'D:/Tania_2/PhD/CSCs/data/' 
studyId <- 'study10' 
GEOId <- 'GSE28799'# GEO Id of the study/series
GPLId <- 'GPL570'


treatment1 <- 'CNSCs'
treatment2 <- 'CSCs'

treatment1Samples <- c('GSM713222', 'GSM713223', 'GSM713224') 
treatment2Samples <- c('GSM713227', 'GSM713228', 'GSM713229') 
sampleNames <- c(treatment1Samples, treatment2Samples) 
studyDir <- paste0(dataDir,studyId) # path of directory of specific study
dir.create(file.path(studyDir), suppressWarnings('already exists')) # creates a new directory for study
getGEOSuppFiles(GEOId, baseDir = studyDir) # get .tar file with study raw data (it's a supplementary file at GEO page)
tarfile <- paste0(studyDir, '/', GEOId,'/',GEOId,'_RAW.tar') # path to tarfile
FilesInside <- untar(tarfile, list = T) # list file names inside tarfile,have to set list = T to only list them
CELfiles <- grep('CEL', FilesInside, value = T) # select names of CEL files only. 
SampleCELfiles <- CELfiles[grepl(paste(sampleNames,collapse='|'), CELfiles)] # select names of CEL files of the samples we want to analyze only
untar(tarfile, files = SampleCELfiles, exdir = paste0(studyDir,'/', GEOId, '/CEL')) # untar CEL files of samples we want to work with into subdirectory /CEL
pathSampleCELfiles <- paste0(studyDir, '/', GEOId,'/CEL/', SampleCELfiles) # path of CEL files of samples we want to work with
rawData <- read.celfiles(pathSampleCELfiles) # read CEL files of samples we want to work with.
normData <- oligo::rma(rawData, background=TRUE, normalize=TRUE, target=core)
ADD REPLYlink modified 10 months ago • written 10 months ago by salamandra330
1

Ah, the issue is the array type, i.e., the U133. The target functionality only works for certain Affymetrix array designs, i.e., those that have a 'Gene' or 'Exon' in the name, and also usually have 'ST', reflecting the different probe design / layout.

So, you have to run rma() without target.

I can provide more information on the differences between these 2 broad classes of Affy array designs, if you wish.

If you wish to summarise over transcripts after you normalise with rma(), you can do something rudimentary like using aggregate() (base R) or avereps() (limma)

ADD REPLYlink written 10 months ago by Kevin Blighe65k

could you then provide that info please? so in this case, by doing rma() automatically summarizes by transcript instead of gene is that it?

ADD REPLYlink written 10 months ago by salamandra330
1

The ST 'Exon' and 'Gene' arrays differ from the U133 and earlier arrays as follows:

  • mismatch (MM) probes are mostly absent
  • to accommodate more targets, feature size reduced from 121µm^2 to 25µm^2
  • target entire length of gene, whereas U133 mainly targets 3` only
  • manufacturer-supplied annotations differ

The absence of MM probes affects downstream processing and means, for example, that the affy Bioconductor package cannot process ST arrays. Also, some normalisation (e.g. mas5) and QC methods that use perfect match (PM) and MM probes cannot be used with these arrays.

The fact that the U133 arrays only target a certain part of the mRNA (third point above) additionally means that probe summarisation via target is not required. You may still see duplicate gene entries after you translate the probe IDs to gene symbols, though.

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe65k

even after translating probe IDs to ensembl/entrez gene ids..thank you

ADD REPLYlink modified 10 months ago • written 10 months ago by salamandra330

I saw in another post that one probe-set targets one exon. So, does it mean that two probe-sets may be targeting two different exons of same gene, and that's why we get even after summarizing a U133 array different probe set ids corresponding to same ensembl gene id? Or is there only one gene per probe-set and we get same ensembl gene id for two probe-sets cause of annotation problems?

ADD REPLYlink written 10 months ago by salamandra330
1

I think that they may be targeting different transcript isoforms of the same gene, but cannot confirm. The UCSC genome browser actually stores Affymetrix probe IDs, so, you can paste in a probe ID to see what it targets. This may help to visualise better these probes that target the same gene. I have never done a comprehensive analysis / review of this, though... I'm not sure anybody has due to the fact that each array design differs from the next.

ADD REPLYlink written 10 months ago by Kevin Blighe65k

If they are different isoforms, doing an average of two might not be appropriate cause one might no be expressed and that brings gene expression down..

ADD REPLYlink written 10 months ago by salamandra330

Yes, of course, in which case you will have to write some code to check for these situations in which the expression is so low for one probe such that it is negligible.

ADD REPLYlink written 10 months ago by Kevin Blighe65k

From GeneAnnot site we can see that different probe sets that target same gene have variable numbers of same probes. For example for gene MAPK1 there is: 1552264_a_at probe set has probes 1 to 11, while 224621_at probe set has probes 1 to 3 and 8 to 11. For same probe set, different combinations of probes identify different transcripts, so the same probe set targets different transcript isoforms.

ADD REPLYlink written 10 months ago by salamandra330

That looks like a bit of a mess... but this, frequently, is how array designs appear when one looks deep into them.

The UCSC Genome Browser indicates that 224621_at targets the terminal exon, while the other (1552264_a_at) targets across the gene.

ADD REPLYlink written 10 months ago by Kevin Blighe65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 653 users visited in the last hour