Thanks to both, i can propose something here for other people that might have the same problem :
Suppose you want to analyse 4 datasets A, B, C, and D, which come from 3 platforms hgug4112a, hgu133a and hgu95av2.
Due to preprocessing, the sets of probes on each dataset may be smaller than the total set of probes. this is taken into account.
The script is :
x = hgu133aENSEMBL
x.mapped = mappedkeys(x)
x.mapped = x.mapped[which(x.mapped%in%row.names(A.data) == TRUE &
x.mapped%in%row.names(B.data) == TRUE )]
y = hgug4112aENSEMBL
y.mapped=mappedkeys(y)
y.mapped = y.mapped[which(y.mapped%in%row.names(C.data) == TRUE)]
m1 = intersect(toTable(x[x.mapped])$ensembl_id,toTable(y[y.mapped])$ensembl_id)
length(m1)
z = hgu95av2ENSEMBL
z.mapped = mappedkeys(z)
z.mapped = z.mapped[which(z.mapped%in%row.names(D.data) == TRUE)]
m2 = intersect(m1,toTable(z[z.mapped])$ensembl_id)
length(m2)
x.names = unlist(as.list(x[x.mapped]))
x.probes = names(x.names[match(m2,x.names)])
length(x.probes)
y.names = unlist(as.list(y[y.mapped]))
y.probes = names(y.names[match(m2,y.names)])
length(y.probes)
z.names = unlist(as.list(z[z.mapped]))
z.probes = names(z.names[match(m2,z.names)])
length(z.probes)
Sorry for that, but there is a mistake in the code above :
DON'T USE THIS SYNTAX : unlist(as.list(x[x.mapped])) => because to avoid duplicate names, it adds a number after the probe name (and toTable don't) :
unlist(as.list(x[x.mapped]))[1:1000]
1007_s_at1 1007_s_at2 1007_s_at3 1007_s_at4
"ENSG00000137332" "ENSG00000204580" "ENSG00000215522" "ENSG00000230456"
1007_s_at5 1053_at 117_at 121_at
"ENSG00000234078" "ENSG00000049541" "ENSG00000173110" "ENSG00000125618"
toTable(x[x.mapped])[1:1000,]
probe_id ensembl_id
1 1007_s_at ENSG00000137332
2 1007_s_at ENSG00000204580
3 1007_s_at ENSG00000215522
4 1007_s_at ENSG00000230456
5 1007_s_at ENSG00000234078
So the right code is :
x.names = toTable(x[x.mapped])
x.probes = x.names$probe_id[match(m2,x.names$ensembl_id)]
length(x.probes)
y.names = toTable(y[y.mapped])
y.probes = y.names$probe_id[match(m2,y.names$ensembl_id)]
length(y.probes)
z.names = toTable(z[z.mapped])
z.probes = z.names$probe_id[match(m2,z.names$ensembl_id)]
length(z.probes)
And then, you can construct a reduced eset with this :
A.eset = new("ExpressionSet",
exprs = as.matrix(A.data[
which(y.probes%in%row.names(A.data)),1:9],
dimnames=list(c(y.probes,row.names(A.pData)))),
phenoData = A.phenoData, experimentData = A.expData,
annotation = "hgug4112a")
B.eset = new("ExpressionSet",
exprs = as.matrix(B.data[
which(x.probes%in%row.names(B.data)),7:15],
dimnames=list(c(x.probes,row.names(B.pData)))),
phenoData = B.phenoData, experimentData = B.expData,
annotation = "hgu133a")
etc ...
So this works fine, but it comes with another problem i can't answer now, how to choose between probes for the same ID. For exemple, in Jeremy Leipzig example, we get a data.frame like that :
merge(toTable(x[x.mapped]),toTable(y[y.mapped]),by="ensembl_id")[1:10,]
ensembl_id probe_id.x probe_id.y
1 ENSG00000000419 202673_at A_23_P68472
2 ENSG00000000457 41329_at A_23_P74320
3 ENSG00000000457 205607_s_at A_23_P74320
4 ENSG00000000460 220840_s_at A_23_P11862
5 ENSG00000000938 208438_s_at A_24_P51517
6 ENSG00000000938 208438_s_at A_23_P103932
7 ENSG00000000938 208438_s_at A_24_P316634
8 ENSG00000001084 202922_at A_23_P352879
9 ENSG00000001084 202922_at A_23_P145114
10 ENSG00000001084 202923_s_at A_23_P352879
11 ENSG00000001084 202923_s_at A_23_P145114
The dimensions of the resulted data.frame is 35006 x 3, which is higher than the original tables' dimensions that were : x => 21363 x 2 and y => 22712 x 2 (because all mappings are given, see lines 5 to 7, and 8 to 11).
I don't know wich slot of the annotation object is the best for the mapping. I choosed ensembl_id but one could have take SYMBOL, or something else. I think CHRLOC is too precise, but SYMBOL might exclude non annotated genes ? Do you know if we could have a slot in the annotation object to mapped probes to an exon ? That might help to match probesets one to each others ?
Thanks again for the help.
Julien