Question: Multiple probe for same gene
gravatar for Natasha
2.4 years ago by
Natasha40 wrote:

I'm looking at the gene expression value of the gene PFKL from GSE20966. The following is the data from one sample reported in the study PFKL PFKL PFKL PFKL GSM524151 58.27 17.71 402.40 439.61

As we see, the magnitude of the expression values are quite different. Also, the probe id associated with value reported in each column is different.

In the paper published from this study, a single value is reported for the gene,PFKL.

I'm confused on this aspect. Do we consider the probe set that shows high expression? Or should we compute an average?

Could someone explain?

bioconductor geo gene • 2.0k views
ADD COMMENTlink modified 2.4 years ago by Kevin Blighe69k • written 2.4 years ago by Natasha40

In my opinion the multiple probes could be targeting exon that contribute to different transcripts isoforms. The final value of the gene they mention is it average of all the probe values or the highest among them? In case you don't care about transcript isoforms, you could follow their logic of finding gene expression for all genes for comparison. If you want to know the expression of transcripts, you could dig the probe platform they use, and they might have information what probe refers to what transcripts.

ADD REPLYlink written 2.4 years ago by piyushjo550

Could you please explain a bit more on what you mean by transcript isoform? PFK has 3 isoforms , PFKL ,PFKM,PFKP. THe example that I mentioned above is PFKL. There were four probe ids just for PFKL . This leads to the confusion. As you pointed out, I agree different transcripts are linked to different probes. In the data file, I clearly see the distinction i.e. PFKM and PFKP are reported separately.

ADD REPLYlink written 2.4 years ago by Natasha40

Microarray: How To Select One Of Multiple Probes Corresponding To A Gene Microarray Expression For Genes With Multiple Probes

Please refer to the following post to gain more information. I think different people might suggest different way, all equally correct/incorrect. Just be consistent, that should be the rule.

ADD REPLYlink written 2.4 years ago by piyushjo550

In response to your previous post,

I could find 4 values from the same trail, For GSM524151,

58.27 , probe id    201102_3p_s_at
17.71 , probe id    211065_3p_x_at
402.40  , probe id g13623608_3p_s_at
439.61 , probe id   g4505746_3p_s_at

Is there any discrepancy in what I observe? Could you please let me know?

I'll definitely read through the posts in the link that you just shared

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Natasha40

It depends what is you end goal? Comparing gene expression among various condition of all the genes or just the gene of your focus? If you want to focus on all genes, just follow the procedure to merge different probe information into one. You could chose media, mean or just the highest. Like I said follow the same rule. If you want to focus on one transcript. Focus on the transcript isoform that you are interested in (as shown below). While different isoform have different transcripts, they could end up coding the same protein or same functional protein. So, the difference in isoforms could be just the utr, which is related to regulation or could be difference in domain that don't have a evolutionary conserved domain like homeobox, zinc finger and so on. Just to summarize, it depends on the end goal.

ADD REPLYlink written 2.4 years ago by piyushjo550

Thanks a lot for the advise. My end goal is to compare the expression level of different isoforms(not at the transcript level) of all the genes. I will stick to the suggestions received and follow this link , consider the highest . There were a few answers, in the posts of the link shared by you, that also suggested to consider the second highest.

ADD REPLYlink written 2.4 years ago by Natasha40
gravatar for Kevin Blighe
2.4 years ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

Hey Natasha. As mentioned, this could be related to different transcript splice isoforms. PFKL is a well-studied gene and multiple splice isoforms have been identified. If the first 2 probes target the exons at the 5' end of the gene, then they will not be expressed in certain tissues due to the fact that PFKL isoforms are expressed at different levels across different tissues.

You can see the splice isoforms annotated by RefSeq here:


The source of tissue used in your study is, thus, important to know here. You can look up the expression of genes at GTEx.


Other possibilities for the finding include:

  1. Badly designed probes. Many of these probes are designed by algorithms and are refined over time (and released / updated on new platform releases) as new knowledge is added to the field
  2. some regulatory mechanism may be reducing the expression of the sequences targeted by those probes, which is something specific to your disease area. In this case, you could compare the expression of these probes in cases and controls separately.


ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Kevin Blighe69k

Kevin, Thanks a lot for the advise. I had a chance to read through some of the other answers on this same question.Some of the answers were posted six years back.

I'm curious to know if there a programmatic way of doing these checks .In an answer given here ,Robert Gentleman has mentioned about automatic this process around 14 years back.

ADD REPLYlink written 2.4 years ago by Natasha40

Robert Gentlemen is something of a legend in the realm of R and Bioconductor. I have never met him, personally, though. Also, when he wrote that, I was not even calling myself a bioinformatician and could probably not even describe what a 'probe-set' really was.

A possible way to circumvent the issue comes through the normalisation process whereby one can 'summarise' expression of probe-sets over genes or exons. In microarrays, a single 'probe-set' may target a single exon of a gene, whereas multiple probe-sets, then, cover various exons in the gene. In microarray analysis realms, we refer to this summarisation [of expression] as 'polishing', for example, a median polish.

The way that you control this type of summarisation is, in part, as I describe here: C: Human Exon array probeset to gene-level expression

I am not aware of any other methods that automatically go through each gene and look at the differential expression of each exon. It could certainly be programmed, though.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Kevin Blighe69k

Kevin, Probably, I'm in the same position now, a beginner asking a lot of silly questions. But thanks to everyone on this forum for being so supportive. I wouldn't have reached this far otherwise.

I wish to ask for a few clarifications on the syntax given in the above link in the context of the following example. Example, the gene PFK exists in three isoforms:PFKL,PFKM,PFKP. From what you have explained above, if my understanding is correct, a number of transcript isoforms are liked to each of the isoform PFKL,PFKM,PFKP.

My goal is to fine a value for the gene,say PFKL(I'm not looking for the transcript isoform information at this point of my reserach).

Summarised per gene:

rma(project, background=TRUE, normalize=TRUE, target="core")

In the above, does gene refer to an isoform say, PFKL or PFK itself?

I hope probe set(e.g.an201102_3p_s_at) refers to a collection of probes that show affinity to bind different fragment of a gene of interest. Does,

rma(project, background=TRUE, normalize=TRUE, target="probeset")

give an outcome like the one below, for each probe set

Expression value(unlogged) 58.27 , probeset id    201102_3p_s_at.

Additionally, I'm was suggested to use gcrma package for normalizing data from affy. I used the above commands for processing the RAW files of GSE20966.

eset1 <- gcrma(project, background=TRUE, normalize=TRUE, target="probeset")
eset2 <- gcrma(project, background=TRUE, normalize=TRUE, target="core")

exprs(eset1) and exprs(eset2) returns the same output. I'm a little confused about the right way to use these commands.

Thanks a ton

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Natasha40

Hey, oh, from my understanding on these PFK genes, they are part of a gene family whose products form a tetramer protein. So, they will be regarded as entirely independent entities on a microarray.

If you do target="core", you should obtain a single value for each of these genes, but this will greatly depend on the design of the array that you are using.

If you do target="probeset". you should see multiple values for each of these genes, each value representing an exon that was targeted in each gene. The array won't actually return any definitive information on the different splice isoforms for each gene.

NB - the behaviour of target will greatly depend on the array type!

Also, when you normalise, are your data-points coming back with names as the original probes? If you need help annotating these, you could take a look here:

ADD REPLYlink modified 19 months ago • written 2.4 years ago by Kevin Blighe69k

Thanks a lot for the explanation. Could you please let me know whether it's possible to get the phenoData from the expression set created using CEL files? I tried pData(eset), but it didn't help. Should it be created manually, like the example shown here?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Natasha40

The phenotype data (pData or phenoData), or 'metadata', may or may not come with the expression data. In the page to which you linked, the pData is created manually and then added to the expression data in Section 2.1.

Note that microarray experiment data is usually stored in a data structure that is called an ExpressionSet. An expresion set contains 'slots' for the expression data (accessed via exprs()), the phenotype data (accessed via pData()), and other values.

When you download the Series Matrix data from the GEO, it's stored in such an ExpressionSet object.

In reality, analyses using limma can be conducted entirely independently of an ExpressionSet object using just a data-matrix of expression values and a data-frame of phenotype data.

I'm aware that microarray lingo can be confusing. There is a lot of incomplete and confusing documentation online, too, which appears common in bioinformatics.

ADD REPLYlink written 2.4 years ago by Kevin Blighe69k

This clarifies many of my confusions. Thank you very much.

Based on the suggestions that was given here, I'm trying to normalize the data from 2 experiments.

downloadedAffyFiles <- list.files(path = "../GSE_RAW/", pattern = "CEL.gz$",full.names=TRUE)#53454
AffyData <- ReadAffy(filenames = downloadedAffyFiles)
eset <- gcrma(AffyData,normalize=TRUE,fast=TRUE)

I have clubbed the CEL file of both the studies in GSE_RAW folder and I am performing a quantile normalization. If I am not wrong,from what I understand from user manual, the normalization step doesn't require a design matrix.

You mentioned, "It is possible to have multiple conditions, though, for example, Controls, Condition1, Condition2, Condition3 - this then becomes a multinomial regression problem, which limma can tolerate"

Could you please suggest whether it is required to have multiple endpoints for performing normalization? I'm sorry, I couldn't exactly understand which step would require multinomial regression analysis.

Thanks a lot for your time and attention

ADD REPLYlink modified 2.4 years ago by Kevin Blighe69k • written 2.4 years ago by Natasha40

Hey, yes, that is correct: the normalisation step does not require a design matrix / formula. Normalisation will just perform the background correction, quantile normalisation, and log2 transformation. The exact form in which the quantile normalisation proceeds will be affected by the very samples that are being normalised together based on how this method functions. That is, the goal [of normalisation] is to render the samples comparable via statistical tests, which means that each sample going into the process has some influence over how the normalisation is conducted. Other normalisation methods, e.g., RPKM and FPKM for RNA-seq, normalise each sample independently, which renders these incomparable via statistical tests (yet people still do compare them).

When you are producing your design and contracts matrices for limma is when you need to properly encode your multiple factor levels.

It could be something, like this, for example:

metadata <- data.frame(
                       condition=c(rep("typeII",4), rep("typeI",4), rep("dementia",2)),
                       experiment=c(rep("exp1", 5), rep("exp2", 5)))
   condition experiment
1     typeII       exp1
2     typeII       exp1
3     typeII       exp1
4     typeII       exp1
5      typeI       exp1
6      typeI       exp2
7      typeI       exp2
8      typeI       exp2
9   dementia       exp2
10  dementia       exp2

group <- paste0(metadata$condition, metadata$experiment)
design <- model.matrix(~ 0 + group)
   groupdementiaexp2 grouptypeIexp1 grouptypeIexp2 grouptypeIIexp1
1                  0              0              0               1
2                  0              0              0               1
3                  0              0              0               1
4                  0              0              0               1
5                  0              1              0               0
6                  0              0              1               0
7                  0              0              1               0
8                  0              0              1               0
9                  1              0              0               0
10                 1              0              0               0
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Kevin Blighe69k

Thanks a lot for clarifying all the naive doubts that I ask.

I'm facing challenge in interpreting the row indexes of metadata.

The numbers in the index running from 1 to 10 , do these represent the sample id's? Is it something like?

              condition   experiment
GSM1     typeII       exp1
GSM2     typeII       exp1
GSM3     typeII       exp1
GSM4     typeII       exp1
GSM5      typeI       exp1
GSM11     typeI       exp2
GSM12     typeI       exp2
GSM13     typeI       exp2

GSM1 to GSM5 are 5 samples from experiment 1 ; GSM11 to GSM15 are 3 samples from experiment 2. For instance , in the real case I have 90 samples from experiment1 and 200 samples from experiment 2. In user manual,the row index is sometimes mentioned to be array 1 to array n.I'm not really able to interpret whether "array" and "sample" mean the same.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Natasha40

Yes, they (most likely) represent the same thing, because a single sample is typically applied to a single array [micro] 'chip'. One thing that you must always ensure is that the order of samples in your metadata (rows) is the same as the order of samples in your expression data (columns). You should implement simple checks like this at various points in your code.

ADD REPLYlink written 2.4 years ago by Kevin Blighe69k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1456 users visited in the last hour