Question: Normalization Methods To Apply On A Data.Frame Object
0
gravatar for Curious Mind
6.1 years ago by
Curious Mind10
Curious Mind10 wrote:

Hi,

There are several methods to normalize data in the form of affyBatch objects. Some of these methods are: threestep, mas5calls, mascallsfilter, justMAS and rma.

Nevertheless, my data is in the data.frame format, as I have read my expression data from a .txt file. Can you please let me know what normalization and filtration methods can I use on a data.frame? Or is it possible to convert data.frame into an affyBatch object?

When I tried some of the normalization methods, I got the following error:

> dat.eset <- threestep(dat.fp,background.method="RMA.2",normalize.method="quantile",summary.method="median.polish")
Error in threestep(dat, background.method = "RMA.2", normalize.method = "quantile",  : 
  argument is data.frame threestep requires AffyBatch

> dat.mas5 <- mas5calls(dat)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘mas5calls’ for signature ‘"data.frame"’

Thanks

R bioconductor • 6.7k views
ADD COMMENTlink modified 6.0 years ago by polarise380 • written 6.1 years ago by Curious Mind10
1

Are you absolutely sure your data comes from an Affymetrix platform? Affymetrix files usually comes in a binary .CEL format.

Secondary, if the file IS from an Affymetrix platform, and it's in .txt, there is a good chance that it has been normalized already, because of above mentioned binary file.

Edit: If you're in doubt, post the first few header lines from your file. You could also make a density plot of the intensities, and it'll show whether the data has been normalized or not.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by David Westergaard1.4k

This from GEO database. As per the authors, it is from Affymetrix platform. ID GSM801843 GSM801844 GSM801845 GSM801846 GSM801847 GSM801848 GSM801849 GSM801850 GSM801851 GSM801852 GSM801853 GSM801854 GSM801855 GSM801856 GSM801857 GSM801858 GSM801859 GSM801860 GSM801861 NM_014543.2_psr1_at 7.78415 7.63683 7.09851 7.41493 6.87848 7.22712 6.99564 5.86747 5.83964 6.61278 7.52737 7.97955 6.6788 7.50651 7.23592 5.37349 6.28702 6.46063 6.30963

ADD REPLYlink written 6.1 years ago by Curious Mind10
1

It looks to me like it's already normalized and in log2 values. Try doing a density plot, and you should see all distributions are within very close proximity of each other.

ADD REPLYlink written 6.1 years ago by David Westergaard1.4k

Thank you so much. This explains why I saw few variations.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Curious Mind10

FYI, the CEL files are availabe for those via GEO and you can process them however you like.

ADD REPLYlink written 6.1 years ago by Devon Ryan91k

Just to be nitpicking: Not all authors provide the raw data files (most do, though!). Some authors deposit only the normalized data files, which is quite annoying.

ADD REPLYlink written 6.1 years ago by David Westergaard1.4k

I did say "...for those" :) Yeah, it's always annoying to have deal with some randomly processed series matrix file!

ADD REPLYlink written 6.1 years ago by Devon Ryan91k

How were the values in the text file processed? BTW, you can always just manually create an affyBatch object, though it doesn't look completely trivial.

ADD REPLYlink written 6.1 years ago by Devon Ryan91k

I am reading the text file as shown below: dat<-read.table("C:\Data\EstrogenSampleData.txt", header=T,row.names=1) Thanks

ADD REPLYlink written 6.1 years ago by Curious Mind10

My question isn't how the data was read into R, but rather how it was processed to create the text file. Are these processed intensities or are they pulled directly from the CEL files or what? The way to process them will depend on how you got the numbers you currently have.

ADD REPLYlink written 6.1 years ago by Devon Ryan91k

Got it from GEO database as "Dataset SOFT file". The authors don't provide additional data. I want to test some publicly available datasets and understand the workflow. Thanks

ADD REPLYlink written 6.1 years ago by Curious Mind10
1

The data in the SOFT files has already been processed (there's no standard way). BTW, you can use the GEOquery package and have it fetch the series matrix file for you. The workflow for that's a bit more straight-forward.

ADD REPLYlink written 6.1 years ago by Devon Ryan91k
library(GEOquery)
# get the ExpressionSet, usually normalized already
eset = getGEO("GSE32394")[[1]]
# get the .CEL files
getGEOSuppFiles("GSE32394")
ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Sean Davis25k

Cross-posted: http://stackoverflow.com/q/18230978/1274516

ADD REPLYlink written 6.1 years ago by Ben2.0k
0
gravatar for Michael Dondrup
6.1 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

That is absolutely possible, at least last time I checked the normalization functions in affy are wrappers around internal functions which finally reduce to functions working on matrices. It is possible to dig out these internal functions and use them, even though it might not be recommended. You can try to dig in the affy source code. I did this once, if you want I can try to find it for you.

ADD COMMENTlink written 6.1 years ago by Michael Dondrup46k
0
gravatar for polarise
6.0 years ago by
polarise380
Galway, Ireland
polarise380 wrote:

The limma package has several normalisation functions that can work on common R data structures. You can use either normalizeBetweenArrays() or normalizeWithinArrays(). The affy package does have a function normalize.quantiles() but it seems to be inaccessible directly (booooring!).

ADD COMMENTlink written 6.0 years ago by polarise380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 552 users visited in the last hour