I downloaded the CGP cell line project expression data and would like to convert the affy probes to official gene symbols. It's the HG U133A v2 platform and the dataset has a total of around 22000 probes. What's the best way to do this? I tried using IDconverter, but it froze after around 100 genes. When I used DAVID to convert to official gene symbol, the results only had about 9800 genes. Using DAVID to convert to entrez returned about 24000 ids, as for some probes, multiple entrez gene ids were returned. How should I deal with these duplicated entrez ids, or is there a better way to do the conversion altogether? Thanks!
In R, for example if I want to convert affy ids“1368587_at” and “1385248_a_at” (rat2302 chip) to their gene ids, I will use the following below:
library("annotate") library("rat2302.db") # here use your chip hgu133a.db select(rat2302.db, c("1368587_at","1385248_a_at"), c("SYMBOL","ENTREZID", "GENENAME"))
For all probes, create a vector of probes and then use select:
PROBES<- as.character(FCMATRIX$probe) OUT <- select(rat2302.db, PROBES, c("SYMBOL", "ENTREZID", "GENENAME"))
# Install your chip .db package from bioc
If you are an R user, consider:
Details on the use can be seen in the AnnotationDbi vignettes.
Alternatively, consider the biomaRt package and see the biomaRt user guide:
You can use BioMart:
library("biomaRt") ensembl = useMart(biomart= "ensembl",dataset="hsapiens_gene_ensembl") affy_ensembl= c("affy_hg_u133_plus_2", "ensembl_gene_id") getBM(attributes= affy_ensembl, mart= ensembl, values = "*", uniqueRows=T)
The problem in conversion from probe ID to entrez or ensembl gene ID is, one probe ID can represent more than one ensembl gene id and visa versa.
The solution is:
- get rid of a probe ID represent more than one ensembl gene ID
- Take the mean or max of multiple prob IDs represent one ensembl or entrez ID
Other solution is you can use Brainarray's costum cdfs. (i prefer this one)
download.file("http://mbni.org/customcdf/21.0.0/ensg.download/hgu133plus2hsensgcdf_21.0.0.tar.gz", "/home/hgu133plus2hsensgcdf") install.packages("/home/hgu133plus2hsensgcdf",repos = NULL) library(hgu133plus2hsensgcdf) library(affy) RawData=ReadAffy(verbose=TRUE, celfile.path=celfilepath, cdfname= "hgu133plus2hsensgcdf", filenames=celfilenames)