Question: How to filter out probsets and genes from Affymetrix microarray expression data matrix?
0
gravatar for Jurat Shahidin
4 weeks ago by
Chicago, IL, USA
Jurat Shahidin80 wrote:

Hi

I am new to microarray data, currently using Affymetrix microarrays expression data for my experiments. Essentially, I have Affymetrix microarrays expression data matrix (Affymetrix probe-sets in rows (32830 probesets), and RNA samples in columns (735 samples)). I also have pheno data which contains metadata information of the above expression matrix (735 in rows (sample identifiers), and 6 description elements in columns).

initial attempt

load("data/HTA20_RMA.RData")
row_medArray <- Biobase::rowMedians(eset_HTA20)
RLE_data <- base::sweep(eset_HTA20,1,row_medArray)
RLE_data <- base::as.data.frame(RLE_data)

my question:

I find the limma case study is helpful but not whole. Basically, I am going to try the following steps:

  1. how to make summarization of Affymetrix microarray expression matrix at gene level?
  2. how to list out probsets intensities per gene?
  3. how to filter out probsets and genes per sample?
  4. how to add gene-level annotation to Affymetrix expression set?

I am wondering how to make this happen above steps? can anyone point me out how to lay out above workflow in R easily? any idea? Thanks

ADD COMMENTlink modified 4 weeks ago by Kevin Blighe45k • written 4 weeks ago by Jurat Shahidin80
2

Very similar questions cross posted to Bioconductor:

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Gordon Smyth840

I should avoid cross post, thanks for your reminding. Do I need to remove my post or just bring my attention next time?

ADD REPLYlink written 4 weeks ago by Jurat Shahidin80

IMO it is best not to remove the cross-posts at this stage because they've already been answered. Removing the duplicate questions would also remove people's answers.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Gordon Smyth840

Thanks for your community effort, I would this keep in mind in my future post.

ADD REPLYlink written 4 weeks ago by Jurat Shahidin80
2
gravatar for Kevin Blighe
4 weeks ago by
Kevin Blighe45k
Kevin Blighe45k wrote:

how to make summarization of Affymetrix microarray expression matrix at gene level?

For summarisation, take a look at the target parameter that is passed to rma(). The 2 possible values are:

  • "core", will summarise to gene-level expression
  • "probeset", will summarise to probe-sets, usually meaning exons

The exact functioning of these parameters will depend on the microarray design - there are many layouts of probes and probe-sets. Your question seems generic for Affymetrix arrays (?)

how to list out probsets intensities per gene?

See my comment regarding the use of target

how to filter out probsets and genes per sample?

You can filter before or after normalisation. Some people try to detect 'dudd' probes (probes that failed) prior to normalisation and filter these out; most people (from my experience), however, just include all probes / probe-sets. Obviously, for the process of normalisation, you need to include control probes that are used for background correction, et cetera. On Affymetrix platforms, control probes usually begin with 'Affx' - see here:

They may have other prefixes depending on the array type that you are using. For a complete picture, download the documentation from the Affymetrix / ThermoFisher website for the exact array type that you are using.

Another package is genefilter, which some use (I do not): https://bioconductor.org/packages/release/bioc/vignettes/genefilter/inst/doc/howtogenefilter.pdf

how to add gene-level annotation to Affymetrix expression set?

There are many, one being biomaRt: A: Affymetrix Human Genome U133 Plus 2.0 Array

There are also usually R-specific packages in Bioconductor, at least for the commonly-used arrays, which you can simply load into your ExpressionSet object. See here: https://support.bioconductor.org/p/63834/

Kevin

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe45k

Dear Kevin:

I just saved myself from wrong understanding about what I am doing. Essentially, I am still trying to generate possible density plot for each gene, to see how good the normalization is done for preprocessed expression data. I think I can try the limma case study for now.

However, I am interested in quantifying Affymetrix expression data by genes with statistical method - coefficient of variation, in order to toss off the genes show fewer changes in expression by using the value of the coefficient of variation. Could you point me out how to lay this out in R?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Jurat Shahidin80

Would you mind take a look at the data source that I am experimenting?

Hey Jurat, sorry, we are 'just' a group of volunteers here.

...but, anyway, why are you using correlation to find the top 10 most expressed genes? Just normalise your raw data (CEL files) in the standard way using RMA, and then transform to Z-scale. Then, the gene(s) with the highest Z-scores can be assumed to be the highest expressed.

ADD REPLYlink written 4 weeks ago by Kevin Blighe45k

Hi Kevin:

in my case, row cell files are pretty big (around 40 GB), so I just use preprocessed Affymetrix gene expression data which I can't do the transformation. what I can do for preprocessed Affymetrix gene expression data matrix? I followed limma' user guide and some steps can't be applicable. Could you point me out how to correct my approach? thank you

when I tried to summarize expression data in this way, I get an error as follow:

> oligo::rma(eset_HTA20, target = "core", normalize = FALSE)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘rma’ for signature ‘"matrix"’

why did this happen?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Jurat Shahidin80

Which file did you obtain? - the Rdata file? That will contain the normalised, log2 expression data. You can use that for Limma and anything else downstream. You can also transform that to Z-scores.

ADD REPLYlink written 4 weeks ago by Kevin Blighe45k

Hi Kevin:

Thanks for your response. Yes, I used Rdata file and tried to summarize this expression data at the gene level and filter out probeset and genes, but didn't get the correct one. I intend to filter out gene because want to run PCA analysis on this expression data. I tried prcomp() for PCA analysis but it is not giving me satisfying results. How can I make this happen? Thanks again for your help, much appreciated.

ADD REPLYlink written 4 weeks ago by Jurat Shahidin80

So, in this Rdata file, the expression values are per gene or per exon? If they are per exon, you can summarise (by mean or median) to gene-level via the aggregate() function.

Can you define what you regard as 'satisfying' results by PCA?

ADD REPLYlink written 4 weeks ago by Kevin Blighe45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 923 users visited in the last hour