Where To Get Cdf Files Needed To Analyze Data Obtained Via The Affymetrix Platform
2
1
Entering edit mode
11.1 years ago

Hello everyone, i was trying to do some microarray data normalization and analysis for learning purpose using matlab, i chose data related to the experiment link given below. http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4069

i got .cel files and .chp files from GEO related to that experiment but the cdf file (Affymetrix Human Gene 1.0 ST Array) i couldn't get from GEO. i googled and found same name cdf file but revision was 3. my question is how to look for a specific cdf file in this type of situation and how can i be sure that the cdf file i'm using is correct cdf file

thank you

• 15k views
ADD COMMENT
2
Entering edit mode
11.1 years ago
k.nirmalraman ★ 1.1k

Some of the unsupported CDF files can be found in the following link. Also as I understand that you know the array is Affy Human Gene 1.0 ST Array, finding a CDF file is straight forward. You can download the CDF here

You might have to create an account in Affymetrix website.

ADD COMMENT
2
Entering edit mode

Hey k.nirmalraman. Actually the link you provided is for Human Exon 1.0 ST Array. I think the original poster wants this one for Human Gene 1.0 ST Array. A google search strangely returns the wrong thing as top result which is I suspect what happened to you. For reference, you can get to this file through GEO as well. From the GEO dataset if you look at 'Sample Subsets' and choose one of the samples, then click on the Platform ID, then follow the provided 'Web link' to Affymetrix's site. You will need to register to download it. Although it looks like this is the same r3 CDF which they obtained from aroma. Either should work. Possibly this is a matlab issue?

ADD REPLY
0
Entering edit mode

when we open cel file in matlab it creates a structure which has a field ChipType this field contains name of cdf file as a string so when we provide actual cdf file to open it in matlab with cel file that actual cdf file's name should match with ChipType string (THAT'S WHAT I THINK) so when i removed ",r3" at the end of the unsupported cdf file i got from aroma-project matlab didn't show any warning but that's not the main issue i want to know can i use this unsupported file? is it the same file how can i be sure? Here is the image of matlab cel file structurehttp://i46.tinypic.com/2me8y8l.jpg

ADD REPLY
1
Entering edit mode

Hi, I think your work with Affymetrix data will be easier if you understand the difference between 'chip type' and 'chip definition file (CDF)', cf.http://aroma-project.org/definitions/chipTypesAndCDFs

Unless your software truly prevents you, the best is to avoid renaming your CDFs. For example, what if you have to different versions of CDFs for the same chip type and you are forced to rename it the way you suggest? How you be able to distinguish them afterward?

The HuGene-1_0-st-v1,r3.cdf provided via the Aroma Project is a one-to-one binary version of the ASCII-version that Affymetrix provides. What Obi says about the term "unsupported" is correct.

ADD REPLY
0
Entering edit mode

I imagine it is the same file and yes I think it would be reasonable to use it. I think Affymetrix only calls it unsupported because they would now prefer (and support) use of the plier compatible files to be used with their own software. If it was me, I would do all this in R/Bioconductor. I will give you an example workflow.

ADD REPLY
0
Entering edit mode

thanks for the reply actually i got one cdf file from herehttp://www.aroma-project.org/chipTypes/HuGene-1_0-st-v1 but it is revision #3 (HuGene-1_0-st-v1,r3.cdf) (that's what i understand by it's name) may be it's the same file or it's not exact file using which that experiment was done and when i opened it in matlab, matlab showed warning that the cdf file name provided in cel file and this cdf file are not same i also edited it's name and removed ",r3" then everything went perfect i did till gene filtering i checked affymetrix website couldn't find there any cdf file named HuGene-1_0-st-v1.cdf and one more thing what is this affymetrix BED file?

ADD REPLY
1
Entering edit mode
11.1 years ago

This is how I would process that particular dataset in R/Bioconductor. It assumes you have latest version of R installed and will have to change some working directories.

#install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()

# install additional bioconductor libraries, if not already installed
biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("hugene10stv1cdf")
biocLite("hugene10stv1probe")
biocLite("hugene10stprobeset.db")
biocLite("hugene10sttranscriptcluster.db")

#Load the necessary libraries
library(GEOquery)
library(affy)
library(gcrma)
library(hugene10stv1cdf)
library(hugene10stv1probe)
library(hugene10stprobeset.db)
library(hugene10sttranscriptcluster.db)

#Set working directory for download
setwd("/Users/ogriffit/Dropbox/BioStars")

#Download the CEL file package for this dataset (by GSE - Geo series id)
getGEOSuppFiles("GSE27447")

#Unpack the CEL files
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447")
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")

setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447/data")
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="hugene10stv1") #From bioconductor

#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)

#Get the important stuff out of the data - the expression estimates for each array
rma=exprs(data.rma.norm)

#Format values to 5 decimal places
rma=format(rma, digits=5)

#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)

#Extract probe ids, entrez symbols, and entrez ids
probes=row.names(rma)
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))

#Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)

#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

This produces a tab-delimited text file of the following format. Note that many probes will have "NA" for gene symbol and Entrez ID.

probes Symbols Entrez_IDs GSM678364_B2.CEL GSM678365_B4.CEL GSM678366_B5.CEL ...

7897441 H6PD 9563 6.5943 7.0552 7.5201 ...

7897449 SPSB1 80176 6.9727 7.0281 7.2285 ...

7897460 SLC25A33 84275 7.6659 7.4289 7.9707 ...

ADD COMMENT
0
Entering edit mode

Also note, as a sanity check, I searched for a few of these probeset IDs in BioMart and confirmed the same probeset-EntrezID-symbol mappings. I was surprised by the huge number of probesets which did not map to a gene symbol or ID. But, I'm not that familiar with this chip so it might be expected.

ADD REPLY
0
Entering edit mode

thanks for your answer this really helped this semester we have to study matlab but i will definitely use r and bioconductor next time

ADD REPLY

Login before adding your answer.

Traffic: 1411 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6